Pre-calculate file entries when extracting a version

bobsilverberg commented 4 years ago

This is a spin-off from mozilla/addons#7463

As discussed there, the file entries for a given browse or compare API request are calculated on the fly, based on the filesystem. Some add-ons have thousands of files, and this calculation can be quite expensive and time consuming.

We could address this by pre-calculating the file entries for a version when the version is extracted, and storing them on disk in a json file. This info could then be used via the API which returns information about a version.

One issue to consider is that, for the compare API, the file entries are based on a combination of two versions, and we wouldn't want to do a pre-calculation and store data for every possible combination of versions. One possible way to address this would be to limit the pre-calculation to a set of versions that are close to the current version. This would need to be coupled with the ability to generate the file entries on the fly if the pre-calculation for a particular compare request did not exist.

┆Issue is synchronized with this Jira Task

bobsilverberg commented 4 years ago

After https://github.com/mozilla/addons-server/pull/14163 lands, and after we update code-manager to use the new params, I think that this is the next thing we should tackle (in terms of API performance). As discussed in a number of places, the fact that entries are calculated on-the-fly for the first API request for a version will continue to cause API performance issues. I believe the idea was that we might do this pre-calculation as part of the process that does the git extraction for a version. Is that accurate @diox? Are we waiting for a finalized version of the new git extraction scheme before tackling this?

cc @willdurand

willdurand commented 4 years ago

I believe the idea was that we might do this pre-calculation as part of the process that does the git extraction for a version.

mmm what exactly should be pre-computed? And how? And stored where?

Are we waiting for a finalized version of the new git extraction scheme before tackling this?

In any case, I think we'll have to wait until the new git extraction scheme is stable and we are confident that it improves the current situation in production (I would like to have some weeks/months to monitor the new approach).

Some additional notes:

Adding some new processing during the git extraction of a version sounds a bit risky to me but I am not saying that we shouldn't do it. That could be done in a different Celery task, probably only created if the extraction has been successful.

On the other hand, maybe the very first API request could build the cache. That's not ideal but that would be a one-time penalty and that would avoid to mess with the git extraction process (or to handle the case where the pre-computation has failed after git extraction but we still need to serve an API response).

I believe @diox will have more concrete info than me, though.

diox commented 4 years ago

I believe the idea was that we might do this pre-calculation as part of the process that does the git extraction for a version. mmm what exactly should be pre-computed? And how? And stored where?

https://github.com/mozilla/addons/issues/7463 point 1. Essentially, this: https://github.com/mozilla/addons-server/blob/4a78845a8457aa14633cc094d619755193b166e6/src/olympia/reviewers/serializers.py#L146-L191

In any case, I think we'll have to wait until the new git extraction scheme is stable and we are confident that it improves the current situation in production (I would like to have some weeks/months to monitor the new approach).

I agree.

On the other hand, maybe the very first API request could build the cache. That's not ideal but that would be a one-time penalty and that would avoid to mess with the git extraction process (or to handle the case where the pre-computation has failed after git extraction but we still need to serve an API response).

API caching on first request is sort of how we do it right now, and it slows down the first request considerably for no good reason: at extraction time, we should have the data, we just extracted it.

willdurand commented 4 years ago

mozilla/addons#7463 point 1. Essentially, this:

Ha, ok. That seems less risky now. So we would have to warm-up the (memcached) cache right after the extraction, right?

API caching on first request is sort of how we do it right now, and it slows down the first request considerably for no good reason: at extraction time, we should have the data, we just extracted it.

Ha, oops. I thought the cache was scoped to the request or had a very short lifetime for some reasons.. So please forget everything about that part 😓

diox commented 4 years ago

I would prefer we do it with a more permanent storage than memcached. Ultimately the file entries list won't change for a given commit, and we definitely want it, so why not store it somewhere once and for all.

bobsilverberg commented 4 years ago

@diox && @willdurand all of the other issues related to code manager and APIs have been resolved, so I was wondering if now would be a good time to address this issue, which is still causing slowness the first time an API is called for a version which has a lot of files. Do you think the timing is right now, and if so, is this something I should work on, or would it make more sense for @willdurand to do it if it's going to be part of the Git extraction code?

willdurand commented 4 years ago

would it make more sense for @willdurand to do it if it's going to be part of the Git extraction code?

That, I am not sure, the backfill is going to be very tricky.

I suppose we could build a different task to build this cache and store it on disk somewhere else (not in git). Depending on how the data is exposed, we might also want to evaluate whether reading from disk is generally faster than only having a single slow API call.

KevinMind commented 6 months ago

Old Jira Ticket: https://mozilla-hub.atlassian.net/browse/ADDSRV-37

mozilla / addons

Pre-calculate file entries when extracting a version #1829

Some additional notes: