Follow user's locale and common names place preference when downloading checklist

tiwane commented 4 years ago

When downloading a checklist (eg https://www.inaturalist.org/lists/110481-Cat-Ba-Island-and-Surrounding-Areas-Check-List), thetaxon_common_name column behaves inconsistently, sometimes using all English common names, sometimes all Spanish common names, etc, regardless of the user's locale or common name preferences.

It would be best to make the names in this column consistent, so I propose that we should use whatever common name (if any) the user would see when they viewed that taxon. So if the checklist includes Anas platyrhynchos and my locale is Dutch, the checklist should use "Wilde Eend" for that taxon in the taxon_common_name column.

Optilete commented 4 years ago

In my opinion a checklist behaves consistent when is has 600 items or less. The common names are accoording the user's locale. The checklist is inconsistent when it has 2000 items or more. It often has English common names or something else, but not accoording the user's locale.

So it is a bug, not an improvement.

https://www.inaturalist.org/check_lists/1579594-Naamlijst-van-de-vaatplanten-in-Nederland

todtb commented 4 years ago

Given that, along with a very cursory glance, I highly suspect that CSVs exceeding 1000 items are being cached. This would result in the inconsistent behavior observed. If one user were to generate the CSV the file would be cached for the next user to download it until that cache expired (an hour?). These means you would get the locale of whoever ran it last.

One quick approach would be to cache locale specific CSVs for each list, though that could potentially grow in size quite rapidly. Another approach would be to rework at what level the caching occurs, but that could be complicated and eliminate some of the performance benefits of said caching.

@kueda If we were to cache locale specific versions at app/controllers/shared/lists_module.rb:128-140 would that overwhelm the cache store?

kueda commented 4 years ago

would that overwhelm the cache store?

Maybe, and I guess we could try it and see, but ideally we should not be making more use of caches_page (it writes to disk, adding more dependence on a somewhat-tenuous NFS-based setup). List CSVs are not very frequently accessed so they don't really benefit from caching. The only real benefit caching may be providing at present is light protection against the unlikely use of list CSV generation in a particularly uninspired DOS attack. A better approach would be to ditch that caching and have these large CSV exports generate in the background like we do with observation exports, but then they will start causing the same problems we have there, i.e. large exports become yet another long-running DelayedJob process that prevent us from deploying for fear of interrupting them.

@todtb, if you're thinking of working on this, my approach might be to switch large CSV generation over to the (also-dated) Riparian / FlowTask system we're using for obs exports, where each export is a FlowTask that has a FlowTaskOutput with the CSV as an attachment. That said, Riparian is kind of an old, crazy mess that I built. It kind of makes sense to me, but these days, @pleary might have a better way of performing these kinds of tasks.

todtb commented 4 years ago

Looking into this a little more @kueda, I'm reconsidering the effects of caching a per-locale or even per user level (as each user may have varying locale/common name settings).

Looks like the only thing stored in memcache is a key generate_csv_taxonomic_#{id} or generate_csv_#{id} to indicate a job is running. Additionally, this is set to expire in an hour.

I think we could handle generate_csv_taxonomic_#{id}_#{user_id} and generate_csv_#{id}_#{user_id} unless these lists are being exported to csv at an incredibly high rate.

Do we have any baseline metrics for how often this occurs?

Also FWIW, Riparian/FlowTasks don't look that crazy or a mess to me even if a bit old :)

todtb commented 4 years ago

Hey @kueda, just checking in if this issue is good to close or if there was still something holding it open such as monitoring cache store?

kueda commented 4 years ago

@todtb, apologies, I didn't do a good job reviewing your PR. While I merged it in, I don't think it actually addresses the original problem here, which is localizing the common names in the CSV exports from lists. I just tried going to https://www.inaturalist.org/lists/110481-Cat-Ba-Island-and-Surrounding-Areas-Check-List, setting my language to French, then downloading the regular CSV and taxonomic CSV files, and both had English common names. I think that in addition the cache key change you made, List#generate_csv needs to be updated to take the logged-in user's name preferences into consideration.

todtb commented 4 years ago

No I'm sorry, that's my bad. I had assumed that aspect was already in place and that we were looking at a cached version of a report generated for say an EN user being sent to a FR user. I'll take another look at this tomorrow.

todtb commented 4 years ago

Hey @kueda, after looking into this a bit more I've noticed a few things on how this is working.

First and foremost, it appears that once the csv is generated, it is placed here: "public/lists/#{@list.to_param}.taxonomic.csv" as a static asset.

Not even clobbering assets, clearing cache, etc will help as unless this file is removed, a GET request for /lists/{LIST_NAME}.taxonomic.csv will no longer hit app/controllers/shared/lists_module.rb#show at all and therefore without any outside processes, this is THE copy of the csv.

So, that make me wonder:

Is there some process running in production that periodically cleans these up (to allow for exporting, say, a newer version of the list)?
Is this handled differently across different environments?
Has this issue been around for some time?

The good news:

Using generate_csv_cache_key with a user_id prevents multiple jobs from starting. That said, until the aforementioned issue is figured out, it's kind of a moot point.
CSV generation DOES respect a user's locale (so long as you are the first user to ever generate the csv, after that it's just a static asset)

I think before continuing it might warrant a bit more technical discussion or your advise regarding the questions above; I certainly don't think I understand everything that's going on from a systems level.

pleary commented 4 years ago

Is there some process running in production that periodically cleans these up (to allow for exporting, say, a newer version of the list)?

No - other than tmp files, Rails needs to manage any files it creates

Is this handled differently across different environments?

The handling of public/static files will be handled differently in different environments. Our development web server is Thin which will prioritize public files over routes with the same path, unless you tell it not to server any static files by adding config.serve_static_files = false to the environment. When it's false the controller will always be hit and there's no benefit to having the file in the public directory.

In our production environments we use nginx and Passenger to serve the rails app. The actual config file isn't public anymore, but the basic setup is nginx has a root directory set to the /public dir and Passenger handles requests to the app. If nginx finds a file in public matching the request it'll just serve that directly and bypass Passenger.

One more consideration for our production environment is we have many different app servers load balanced behind www.inaturalist.org and for large files and long-running tasks like this we want the output to go to something shared across all the apps, like an NFS-shared directory or a backend like the database or memcached. The public/lists directory is one that is already shared, and we can make others shared as needed.

Has this issue been around for some time?

I suspect its been happening if not forever, for years

So it seems like this approach of having Rails respond to the route lists/:id.csv and having custom code to save files into public/lists/:id.csv won't work well with the current setup. I was thinking if we could use caches_action for this at all, but that won't work for a couple reasons - that'll end up using Memcached and some CSV files will be too large. The delayed job component also means we'd have to write the cache file programmatically for delayed jobs. I was also looking to see if there's any nginx config to tell it to ignore public files in certain subdirectories, but that's kind of odd and seems like a bad practice.

I'm thinking a better approach would be to change the path where we write these CSV files - creating different files as needed for things like locale, or even locale plus users preferred place - and if we get a new request and its corresponding CSV files exists and is done processing serve that with something like render file: /path. Ideally it would be written to a directory shared across all production apps. I suppose it could technically remain in public as long as the path didn't conflict with an existing route. Since the requests were going through rails it could handle any cache invalidation. We could also decide to regularly delete older cached CSVs.

todtb commented 4 years ago

Thanks for the insight, @pleary.

So if I understand correctly, as suspected, the static assets that are generated under /public/lists are being served in production as opposed to generating a new csv. The caching we have right now isn't really playing too much of a role in this, simply preventing multiple jobs from firing simultaneously (a good thing in and of itself, but not pertinent to this issue).

In the original PR I had naively updated generate_csv_cache_key method based on user_id, but as mentioned above that was just for the cache key controlling jobs. Perhaps a similar approach could be taken for the naming of the generated file instead such as 1234-some-list-#{user_id}.taxonomic.csv. Or maybe even just the locale: 1234-some-list-en.taxonomic.csv.

I agree that storing the output in cache would probably not work due to combined file sizes overwhelming it, so that leaves us with storing it on the NFS mount. Now the question of clearing out those assets. I'm sure we can find a expiry time for those that makes sense (ex: purge them 1/8/24/whatever hours after file creation), but with a shared NFS mount this becomes marginally trickier though I don't think concerns over file locks would be too bad.

One approach would be a scheduled job through Rails (basically a cron wrapper), or just a straight-up crontab job. However, with such an approach I would typically offload that to another, non-web, worker that also had the mount. I don't know if you have something like that available already or not, so that might be a not-insignificant ask for a CSV locale feature.

Whatever route is taken, I think at this point the next steps may extend beyond the capabilities of external (community OSS) implementation due to some very systems-specific questions , but I'd be happy to help wherever possible!

inaturalist / inaturalist

Follow user's locale and common names place preference when downloading checklist #2688