googleapis / google-cloud-python

Google Cloud Client Library for Python
https://googleapis.github.io/google-cloud-python/
Apache License 2.0
4.83k stars 1.52k forks source link

Cloud storage API? #1959

Closed hohaichi closed 7 years ago

hohaichi commented 8 years ago

Hi,

Google Cloud Storage accepts a customer-supplied encryption key as a base64-encoded string in JSON and XML APIs (https://cloud.google.com/storage/docs/encryption#customer-supplied). Why does GCloud work differently, taking either raw bytes or raw strings? I found it confusing. Also, the error messages do not match the error in this case.

Thanks for taking a look.

daspecster commented 8 years ago

Hello @hohaichi! gcloud-python handles the base64 encoding before the request as a convenience.

Which errors are you referring too? It would be good to make those as clear as possible.

@tseaver & @dhermes do you guys have anything to add?

hohaichi commented 8 years ago

Hi @daspecster,

This is the error:

with open('/usr/local/google/home/hohaichi/helloworld.txt', 'rb') as f:... blob.upload_from_file(f, encryption_key='0QZa2d3ZYTghqnSn1Ca1OdSQXaN/tujOus7nVbxo6nE=') ... Traceback (most recent call last): File "", line 2, in File "/usr/local/lib/python2.7/dist-packages/gcloud/storage/blob.py", line 539, in upload_from_file self._check_response_error(request, http_response) File "/usr/local/lib/python2.7/dist-packages/gcloud/storage/blob.py", line 401, in _check_response_error error_info=request.url) gcloud.exceptions.BadRequest: 400 Missing an encryption key, or it is not base64 encoded, or it does not meet the required length of the encryption algorithm. (https://www.googleapis.com/upload/storage/v1/b/chi-gcloud-trial/o?uploadType=media&name=foo)

daspecster commented 8 years ago

Per the link you mentioned before, that is the expected error.

You will receive an HTTP 400 error in the following cases:

  • You upload an object using a customer-supplied encryption key, and you attempt to perform another operation on the object (other than requesting or updating metadata) without providing the key.
  • You upload an object using a customer-supplied encryption key, and you attempt to perform another operation on the object with an incorrect key.
  • You upload an object without providing a customer-supplied encryption key, and you attempt to perform another operation on the object with a customer-supplied encryption key.
  • You specify an encryption algorithm, key, or SHA256 hash that is not valid.

https://cloud.google.com/storage/docs/encryption#response

Unless I'm missing something?

hohaichi commented 8 years ago

As you pointed out by the link, the error comes directly from the underneath API, which expects a base64 string encoding 32 bytes. It is confusing when gcloud doesn't expect a base64 string but complains that the input is not base64.

That said, IMO consistency between gcloud and API would make it easier to use and move between different libraries. It is not confusing that GCloud accepts raw bytes for convenience. But it is confusing that GCloud accepts strings as well, but expect a different format.

daspecster commented 8 years ago

Ok, I think I see now. So this error is actually a bug as well.

Two things actually.

  1. Even though you pass a base64 encoded key...it should still accept that and then base64 encode it again and upload.
  2. If it the key is base64 encoded, and we encode it again, then you would see issues with other libraries like you mention. So we'll have to figure out a plan for this I think.
hohaichi commented 8 years ago

Yes, but for (1), I guess it still accepted the base64 string, encoded it again and uploaded. The problem in this case is the length. GCloud expects a string of 32 characters.

In general, strings would go with encodings. When reading GCloud doc (http://googlecloudplatform.github.io/gcloud-python/stable/storage-blobs.html), I was wondering if the encryption_key string should be ASCII, hex, or unicode (and which one?). If my guess is right, GCloud expects ASCII (which may have problem with non-printable characters), although the examples look like hex ('aa426195405adee2c8081bb9e7e74b19').

daspecster commented 8 years ago

In this example, it shows using just a string of 32 characters as the encryption key and not a hex() value. https://googlecloudplatform.github.io/gcloud-python/stable/storage-blobs.html#gcloud.storage.blob.Blob.upload_from_file

We could change it somewhat to make it more clear.

>>> client = storage.Client(project='my-project')
>>> bucket = client.get_bucket('my-bucket')
>>> encryption_key = '####my 32 byte encryption key###'
>>> blob = Blob('secure-data', bucket)
hohaichi commented 8 years ago

Sorry, I meant 'aa426195405adee2c8081bb9e7e74b19' in the examples look like a hex number. :)

jgeewax commented 8 years ago

Do we have precedent here for whether we should do the base64 encoding or not ? I can see the argument that we shouldn't if it means that people might upload using the CLI or the API directly and then want to access via gcloud-python only to find out that the key doesn't seem to be valid...

Is there a way that we can make it explicit that when you're passing in base64 vs non-base64?

ie:

blob = Blob(... encryption_key='raw encryption key base64encoded')

# or

blob = Blob(... encryption_key=client.EncryptionKey('non-base64 encoded'))

?

I guess this comes down to the big question: are we supporting a "password" ? or an opaque key? If a password, then we should regular bytes (str) and base64 encode it for convenience. If we're accepting an opaque key (akin to say, a bitcoin private key....) then base64 seems like the right way to accept data, and we should pass it along as is.

Thoughs?

tseaver commented 8 years ago

ISTM that library users should not base64-encoding the encryption key: that is a requirement of the transport (raw-bytes-over-JSON), not the API surface. We should just expect users to give us exactly 32 bytes of whatever, and we handle passing it to the server correctly. E.g., if in the future we had a gRPC variant for CloudStorage, we would not be base64-encoding the key before assigning it to the protobuf field for the key.

hohaichi commented 8 years ago

I don't know if "password" keys or opaque keys will be more popular. And honestly, I did find that the gcloud-python library is very easy to use as it is. My concerns are on consistency though:

And a minor point: the example key could be made clearer as bytes than a string of hex digits.

jgeewax commented 8 years ago
  1. I agree the docs shouldn't hint that the key is hex digits if it's not actually limited to hex digits.
  2. With an opaque key of bytes (rather than, say, Datastore's keys or GCE object names) I suspect we might want to tie the format more closely to how people traditionally represent bytes as text (in this case, either hex or base64).
  3. I worry that if someone uses the gsutil / gcloud CLI and then then tries to pass the key in through gcloud-python, they're going to be mad if their base64 key doesn't work (it'll be rejected as the wrong key) and they have to base64-decode it so that gcloud-python can re-base64-encode it...

if in the future we had a gRPC variant for CloudStorage, we would not be base64-encoding the key before assigning it to the protobuf field for the key.

I'd argue that the API surface would still want to accept the base64 encoded version, which we then decode and send as raw bytes in the RPC call...

tseaver commented 8 years ago

I suppose we could sniff at the supplied key: if it is of type bytes and its length is exactly 32, then we base64-encode it before jamming it into the headers (ught!); otherwise, we expect that it is already base64-encoded and just jam it in directly.

daspecster commented 8 years ago

Detecting if it is, in fact base64 could be tricky and not super consistent and I think it would be frustrating to have false positives, especially when it comes to security.

What if we had a flag to denote if the key is already base64 encoded? Downside is that having a flag might force users to understand more layers in the system potentially.

blob = Blob(... encryption_key='non-base64 encoded')

And then either blob = Blob(... encryption_key_b64='bm9uLWJhc2U2NCBlbmNvZGVk') or blob = Blob(... encryption_key='bm9uLWJhc2U2NCBlbmNvZGVk', b64encoded=True)

tseaver commented 8 years ago

I don't think we have to detect that the user has passed us base64-encoded bytes. If the passed value is 32 bytes exactly, it can't be a valid base64-encoded key: those have to be 42-44 bytes long. So in that case, we can just base64-encode it and go on. For any other case, we just assume the user did it already, and pass it through untouched: if it happens that they somehow passed a 33-byte random byte array, the back-end will blow up. E.g.:

    if isinstance(encryption_key, bytes) and len(encryption_key) == 32:
       encryption_key = base64.b64encode(encryption_key)