kothar / go-backblaze

A golang client for Backblaze's B2 storage
MIT License
94 stars 31 forks source link

Get a separate upload URL for each upload request #11

Closed kothar closed 8 years ago

kothar commented 8 years ago

When using the provided UploadFile methods, upload URLs will be returned to a pool for re-use to avoid an extra call for each file when uploading many files.

Error code recognition for fatal vs. non-fatal errors has been updated to include more non-fatal cases according to the upload documentation.

This closes #9

ErhanAbi commented 8 years ago

@kothar Thanks for the work on fixing #9 . So basically, you're now reusing some of the upload tokens by storing them within a pool everytime an upload has finished using it. But I couldn't find the expire time of an upload token in their documentation.

What I am trying to say is that it really depends on how much time does it take to expire a token. If it expires immediately after uploading a file, then the pool of tokens renders to be useless. Bc when trying to reuse the token, the server is going to respond with 401 token_expired and thus you'll create another token which'll be used for uploading.

So instead of making a call everytime to get a fresh token you're relying on an old one and risk to try an upload that's having high chances of failure. (from their docs - on each upload, client should acquire upload token first).

I also have another use case in mind right now. Say you have files A, B and C and you're uploading them at moment 0. All of them are successfull uploads and you have a pool of 3 upload tokens which you can reuse. After 10 minutes (enough time to expire all your tokens), you want to upload file D.

Following your UploadFile function, you're basically uploading and thus using an old upload token. The upload fails with 401 token_expired, you invalidate the token, remove it from the pool and you retry the upload. The 2nd try you use again an expired token. But this time you don't retry your upload anymore, but you're just returning the error - i.e. 401 token_expired; And you're still left with an expired token in your pool.

Also, since you specify that the upload will retry once more if the upload request was non_fatal, you're consuming one chance by using an expired token from the first try.

Those are just my thoughts ATM after skimming through the source code. Or I was missing something?

Thanks!

kothar commented 8 years ago

Hi @pfzero, thanks for such a detailed use case explanation, it's definitely got me thinking.

I was basing the token pooling on the documentation here: https://www.backblaze.com/b2/docs/uploading.html

The URL and authorization token that you get from b2_get_upload_url can be used by only one thread at a time. If you want multiple threads running, each one needs to get its own URL and auth token. It can keep using that URL and auth token for multiple uploads, until it gets a returned status indicating that it should get a new upload URL.

My first thought was also to remove all caching of upload tokens, but it occurred to me that if you are uploading thousands of files, you can avoid an extra API call by re-using existing tokens until they expire.

Since B2 has other potential error modes (internal server error, for example) the user of this API library must also check for them and retry if possible, so introducing these timeout errors shouldn't add any additional complexity to the calling code.

In the case where you have 5 pooled tokens which have all expired, I agree there may be a problem with retrying them all.

I see three possible changes:

  1. As you suggest, remove all caching. I've just noticed that get_upload_url requests are 'Class A', which means they are not billed. If the client is using multiple threads to upload, the additional latency of the extra API call should be hidden.
  2. Make the bucketAuthorizationState type public, allowing the user to re-use the upload URL as desired. This should probably be a new UploadHandle type, which could internally renew the upload token in the event of a upload_token_expired status being returned by the API.
  3. Leave the token pooling as it currently stands, but if one token is invalidated then the whole pool should be invalidated, preventing multiple invalid tokens being returned.

Do you have a preference for these? All three will leave the existing API intact, but 2 will force the user to implement new code if they want to re-use tokens.

A related issue is what to do with re-trying. The single-retry behaviour is convenient most of the time, but doesn't guarantee it will succeed, so the client may still need its own retry behaviour. The documentation suggests retrying 5 times with exponential backoff, so I think that should probably be the implemented behaviour in UploadFile. In UploadHashedFile it's not possible to auto-retry due to passing in a stream, although it might be possible if the input was changed to be a seekable stream. I'd welcome any thoughts on this if you are using the library in a way that makes the auto-retry useful.

ErhanAbi commented 8 years ago

Hi @kothar, sorry for the late response.

I don't think we have enough information to choose a good option from the list of solutions that imply caching tokens. We're quite in a "Jon Snow" situation as we "know nothing" about the expiration of the upload token. From the docs: An uploadUrl and authorizationToken are valid for 24 hours **or** until the endpoint rejects the upload. So we're not safe to assume that the token will hold valid for 24 hours, because in this case the pool would've make sense.

The best would be to create a ticket on their website and to raise this issue because it surely impacts their backend performance by not providing the api consumers a way to check if the token expired.

Not having a way to determine if the token expired other than trying blindly to upload a file and see that the token expired is going to lead us to find some complicated solutions for handling the pooling of tokens.

The "best" way that I can think of is to have a stack of tokens -> where the last token is the newest acquired -> and everytime an upload occurs, use tokens from the newest acquired to the oldest. If the newest one expired (or a newer one), invalidate all of the older tokens. You might also add an automatic removal/expiration of the token after 24 hours. Also, if the devs from backblaze implement an endpoint to check the validity of a token, this can be further implemented in the datastructure so when acquiring a token, the getter function should fetch the newest cached token -> check if it's valid and after that either invalidate the whole stack of older tokens, or, simply return it in case it's still valid. Other than the above solution, I think the cache-free solution would be the other, simpler option.

The auto-retry option is a really nice feature that I'm sure the users of the library will find useful. You might control it via SetAutoRetry(flag bool) method, a retry after duration method that might default to some clever default - e.g. 30 seconds and double/triple/e.t.c. its time at each retry and maybe a NumberOfRetries methods with a default of 3 retries. Ofc this can be implemented with time, because the way the retry is implemented atm is good enough IMO.

Waiting for any other questions and maybe I can also help you with any of the above implementations. Thanks!

kothar commented 8 years ago

I've been thinking about this on and off today, and I think you're right. We don't know enough about the expiry to provide a sensible caching strategy. Rather than doing complex checks and pools, we can just go back to the simplest possible implementation - getting a new URL each time.

Regarding the token validity endpoint, that still wouldn't help because making that call means you might as well call the get_upload_url endpoint instead.

If you fancy taking a stab at implementing auto-backoff retries I would be happy to accept a PR. I have some other projects to look in on before I could look at that myself. The only caveat I've seen is that the documentation says not to use backoff if the tokens expire. Everything else should use some kind of exponential doubling as you described.

Thanks again for your input!

Mike.

ErhanAbi commented 8 years ago

I'm glad that we were able to find consensus within this issue. At the moment, it's good enough to take the token each time the consumer performs an upload - optimisations here should be made by the devs from backblaze, at least we rely on newer implementations of their api.

The token validity endpoint would help in the context of having a stack of tockens and if one token expired, will produce a domino effect on the older tokens by clearing them as well. Also, I would prefer a simple GET request with a single query parameter to check the token instead of POSTing (or at least trying to) the whole file upload. I know that the file wouldn't be sent anyways because of the nature of the POST request, but there are a lot of additional (and potentially more) metadata that are sent within the head of the request envelope. Also, checking for token validity at its final point, by performing an upload, it seems to me as a bad api practice.

Maybe we could ping the devs from backblaze to this issue. We might be missing something or maybe they would be more focused on client usage of the API. Because right now, with this issue at hand, it seems that the API is quite server-y created.

ErhanAbi commented 8 years ago

I forgot to tell you, I might have some available time in the near future and will maybe try to implement the auto-backof policy for uploading requests. I would really love to do it and afterwards I'll ping you with a PR.

Erhan.

kothar commented 8 years ago

Great, I will leave it to you in that case!