inaturalist / inaturalist

The Rails app behind iNaturalist.org
http://www.inaturalist.org
MIT License
664 stars 190 forks source link

Fix MushroomObserver Import (504 Gateway Timeouts) #3331

Closed joshuamcginnis closed 2 years ago

joshuamcginnis commented 2 years ago

Users are reporting a 504 Gateway Timeout when using the MushroomObserver import tool. There is an aggregated list of user reports concerning this issue here.

MushroomObserver maintainer @JoeCohen has kindly pointed out the potenial source of these issues. I'll try and condense them here but you can see the full explanation at the iNat forum link above.

1. The MushroomObserver Observation XML API Response is returning invalid observation urls

This bug has been reported in MushroomObserver/mushroom-observer/issues/815.

Once fixed, the MO import tool can undo it's workaround, trust the response urls, and stop calling HEAD on each url which may be why the page times out for people with lot's of observations.

2. The MushroomObserver import code is using the wrong url format for resolving image sources in https://github.com/inaturalist/inaturalist/blob/88c24041fe0df7f81fe1d04914ecdba29c319f43/app/models/mushroom_observer_import_flow_task.rb#L303

It is currently in the format:

image_url = "[https://images.mushroomobserver.org/orig/#{image[:id]}.jpg

The correct way to reference image urls is:

https://mushroomobserver.org/images/<*size*>/<*id*>.<*ext*>

@JoeCohen It probably wouldn't be a bad idea to add the image urls to the api2 response so that clients don't need to worry about url construction.

3. MushroomObserver recently moved from DigitalOcean to Google Cloud Platform.

This means this url needs to be updated as well: https://github.com/inaturalist/inaturalist/blob/88c24041fe0df7f81fe1d04914ecdba29c319f43/app/models/mushroom_observer_import_flow_task.rb#L293

Although it doesn't appear iNats importer looks at anything specific in potential error response bodies, future additions might incorporate parsing the response body and providing a better error to the end-user.

  1. The import code assumes source images are all jpg and this is not always the case and may cause 404's or other errors when downloading images.

Proposed Fix

joshuamcginnis commented 2 years ago

The MushroomObserver team has fixed the urls in api as well as added an <original_url> node to the list of images returned so that iNat doesn't need to harcode image paths. These two fixes should simplify and make the iNat MO Import tool more resilient.

kueda commented 2 years ago

I'm just going to remove this tool. I built it to facilitate people moving from MO to iNat, not to facilitate duplication of records between the platforms. We intend iNat as a source of data, not a sink. Since a few of the handful of people who use this tool seem to use it to work against that intent (and the tool tends to break), I don't think it's worth maintaining.

joshuamcginnis commented 2 years ago

I can appreciate your position, but I don't agree with how you've handled this. Just closing the issue under a generalization of how many people use this tool is unwise. You sought no opinion from the community of which clearly, there are many users who do in-fact need and use this tool for a variety of reasons. That the tool breaks so often is a reflection of how it was implemented, not the users who use it. I would propose that your closing the tool will in fact encourage less collaboration between iNat and MO, not more.

heelsplitter commented 2 years ago

I agree with @joshuamcginnis here. I'm a PhD student working on fungal systematics research and I've been using this tool for the past couple years to duplicate my records. I appreciate that this tool was developed to facilitate users moving from MO to iNat, but this is not how it was perceived by MO users at the time, or by myself or other uses using the tool for FunDiS (formerly North American Mycoflora Project) related projects. Some BioBlitz projects, permits or grants have required observations to be uploaded to iNat. I use MO primarily to hold my collection data and images (original size/quality), as well as for printing collection labels for herbarium submission (a feature not yet implemented in iNat). My desktop collection image / notes database is also organized by MO number. This feature has allowed me and others to comply with the aforementioned requirements while not sacrificing the personal herbarium management functionality of MO. It also acts as a failsafe in case MO goes down permanently, which seems more of a possibility than the reverse. I do hope you reopen this @kueda. It's a very useful tool even if it isn't being used the way it was originally intended.

AlanRockefeller commented 2 years ago

I really hope this tool gets fixed, I have a whole lot of really high quality photos on MO that would be really nice to have on iNat.

kueda commented 2 years ago

@heelsplitter, I sympathize with your use case, and I'm willing to repair and maintain it for a limited time period to help you and anyone else using it comply with existing permits or grants until you can figure out workarounds, like writing a script that you maintain that does the same thing. What would be a reasonable period? 6 months?

AlanRockefeller commented 2 years ago

Any amount of time that it works would be really useful, and a script I could run locally to keep my data in sync would be great too.

I would tweak it to help keep things in sync like microscopy photos that I have added since the import ran and DNA sequences. I would have it compare the number of photos on each observation and compare the Mushroom Observer sequence to the DNA Barcode ITS field.

seiryoku-zenyo commented 2 years ago

As of today, this import tool is not working. I understand it requires some time and effort to keep.

But I'm not sure iNat is fully realizing the benefits of providing users with the possibility to bridge records from MO. By discontinuing such tool, every side loses. The way of the future is interoperability between databases, GBIF, iNat, MO and others should all sync with each other. MO is a mycological database of reference, with high standards and valuable high quality observations. Keeping these entries alienated from iNat is bad for the community and especially for iNat. Such tool wouldn't make iNat a 'sink', it would make iNat a better mycological database with better average fungi records than now.

I agree with others in having the tool online at least for some period of time, so to give a chance to those active mycologists to bridge their work into iNat, contributing for professional and amateur scientific communities. I believe that is the central goal of all these platforms and all of us.

kueda commented 2 years ago

@seiryoku-zenyo if it's not working for you, can you paste in the error messages you're getting? And/or tell me your inat username so I can try to replicate? It should be working until June.

seiryoku-zenyo commented 2 years ago

Hi, Thanks for the reply!

Weird, I thought it had been down all this time...

I simply get:

504 Gateway Time-out nginx

AlanRockefeller commented 2 years ago

I am also getting "504 Gateway Timeout" when previewing my import.

kueda commented 2 years ago

@seiryoku-zenyo and @AlanRockefeller, can you give it another try? MO changed the default response size for their API calls so the preview tool was trying to show 100 observations and all their photos in the preview, which was taking a long time and timing out, hence the errors. Now it should only show 10 in the preview.

seiryoku-zenyo commented 2 years ago

Hi, it seems to preview well now. But it seems it's about to import observations already previously imported, possibly making duplicates(?), at least this has happened in the past.. Should I submit and confirm it will duplicate previous records?

kueda commented 2 years ago

It should not import duplicates... as long as you haven't changed or removed the "Mushroom Observer URL" observation field. The preview just shows you that the mapping from MO attributes to iNat attributes is working.

seiryoku-zenyo commented 2 years ago

yes, thanks a lot, the tool works fine now. There is some other issue though, could you check your message box in iNat?

Cheers

AlanRockefeller commented 2 years ago

I ran the import tool and it imported about 20 observations, but there's a couple hundred more that it didn't get to.

I got the following error message:

"Your Mushroom Observer import is finished. You can see the results at http://www.inaturalist.org/observations/alan_rockefeller

There was also this one big error that caused the whole process to stop:

NameError :: undefined local variable or method `mo_url' for # Did you mean? mo_user_id"

I ran the import tool again and got the same error message, and it didn't import any additional observations.

kueda commented 2 years ago

@AlanRockefeller, found another bug, ironically one that only halts the entire process when it fails to import a single observation. I think it should be fixed if you want to give it another go.

AlanRockefeller commented 2 years ago

It works great now, thanks for fixing that!

jwidness commented 2 years ago

Report from the forum of "Error: API key is not valid".

kueda commented 2 years ago

Looks like Mushroom Observer changed their API and now our tool isn't working again. Not much we can do on our end without some additional information from them, since I'm not finding any documentation of this change. In brief, we were doing something like this to verify that your API key works and to get your MO user ID to retrieve on your observations:

curl --user-agent iNaturalist \
  -XGET \
  "https://mushroomobserver.org/api2/api_keys.xml?api_key=YOUR_API_KEY"  

That used to return info about the user who owns that key, but now it returns

<error id="1">
  <code>API2::NoMethodForAction</code>
  <details>Invalid request method "GET" for "api_keys".</details>
  <fatal>true</fatal>
</error>

If someone familiar with the MO API wants to suggest another way to get the user ID associated with an API key, let me know.

joshuamcginnis commented 2 years ago

I believe this api call is used to obtain the users' MO user_id. The solution would be to just ask for the users MO user_id in addition to their api key and then you could skip this call altogether. It's probably trivial to get MO to add user_id to the existing api keys page.