inaturalist / iNaturalistAPI

Node.js API for iNaturalist.org
https://api.inaturalist.org/
102 stars 29 forks source link

Create observation with obs field values in one API call #199

Closed tomsaleeba closed 4 years ago

tomsaleeba commented 4 years ago

This is a feature request (or opening a discussion) not a bug.

The summary For observations a user creates with our web app, we need to create 56 observation field values, which means 56 separate requests to the API (plus some others for photos, etc). The API provides all the endpoints we need but it's very granular so we're looking for an endpoint that has a higher abstraction to match our normal business logic.

60+ HTTP requests for each observation is a lot. This issue is to discuss ways to reduce that number.

More detail

Our web app, https://app.wildorchidwatch.org/, that uploads observations directly to iNat, recently went live :tada: yay! Feel free to check it out. You could also use the beta environment (https://beta.app.wildorchidwatch.org/) that's pointing to our own copy of iNat and feel free to upload any rubbish (photos of your feet, etc) to that.

Our users are finding that it's taking a long time for observations to upload. The cause of this is the fact that we're using quite a few observation field values. We have a simple mode that only creates 3 observation field values and a detailed mode that creates 56 field values. Yes, 56. I'm not sure if you ever intended observations to have that many obs fields but we need to capture all the extra data for our citizen science project.

Most of the upload time for an observation is spent with the overhead of HTTP requests. That in itself isn't too bad as we can still do all the uploads in about 2 minutes. The real problem is that we're doing this from phones. Users don't leave their screen on, watching the upload, they turn the screen off and put the phone in their pocket. We're using background sync in service workers to keep uploading in the background but our thread still gets killed off by the browser before we can make all the requests. This is presumably due to battery saving.

There's also the API etiquette that we don't want to run afoul of as we are making a lot of requests. Or at least, each user of our app is making 60+ API calls (including the observation itself, photos and linking to the project) per observation they make.

The solution is to do fewer requests. The ideal situation is we only do one request for the observation, all photos, all obs fields and the project link. That's a pretty big change though so even if we could achieve creating all the obs fields in a single request, that would see huge benefits.

I can see there's a V2 of the API coming. As part of that work, it looks like creating an obs with obs fields in one call might be coming: https://github.com/inaturalist/iNaturalistAPI/blob/20fb291/openapi/schema/request/observations_create.js#L4. If I'm guessing right here, this is what we're looking for :heart_eyes: Would it be possible to include linking with a project in here too?

Uploading photos as part of the same HTTP request is a nice-to-have lower priority. I can see that the Rails app uses a workflow where photos are uploaded before the obs to inaturalist.org/photos and then the photos are referenced as local_photos in the obs upload. This stuff here: https://github.com/inaturalist/inaturalist/blob/59c6658/app/controllers/observations_controller.rb#L599. As far as I can tell, this still requires separate HTTP requests (and it's not available via the NodeJS API). So it's not the solution we're looking for.

One solution I'd thought of is to create a bulk obs field values endpoint on the Node API. It would accept an array of obs field values and process them all. The naive approach is to just delegate to the existing, singular obs field value endpoint, which works but isn't very atomic as failures mean some values are still uploaded. The clobbering-on-POST nature of obs field values means this isn't really that big of an issue. However, we could do better by validating all field values first, then doing the create. I feel that this endpoint would still be valuable even if the V2 API allows creating obs field values at the same time as the obs.

I'm prepared to do the work to get this feature, but I'll need guidance from the iNat team on how best to implement.

FYI @tokmakoff

tomsaleeba commented 4 years ago

So, that "naive approach" looks like this: https://github.com/ternandsparrow/iNaturalistAPI/compare/master...ternandsparrow:naive-bulk-ofvs

After getting my hands dirty with the code, I can appreciate that changes on the Ruby side will be required to do something more featured like validating all the values before actually performing the create actions. Unless we can use the pgClient to access the DB but I can't see any other write uses of the DB connection and it's not very D-R-Y.

kueda commented 4 years ago

Hey Tom. Ok, this raises a couple concerns, and as I write them out I realize they're not included in our docs and are more like unwritten opinions I'm assuming (perhaps without reason) most of the iNat team shares, so I guess I'll list them as reasons I don't think we should do this work, and ask that you not take them as official iNat gospel (yet, at least).

  1. iNat is not the place to store application state that is only relevant to your application. I think we provide write operations in our API so third parties can help their users engage with iNat, not to provide a data store for any application doing something with observation-like records. I don't think what your app is doing is totally out of line, but I don't think we're going to build functionality to support that kind of usage. If your app needs to store state that's only relevant to your app, I think you should be storing that data yourself.
  2. Observation fields were not conceived or implemented very well and we're unwilling to do more to support them. We thought they would provide flexibility for citizen science projects that wanted a bit more than the data model we were providing, and allow more bottom-up crowdsourcing of useful information beyond what we'd modeled. Instead, they've become a giant, confusing, uncontrolled mess where there are many cases in which 10+ fields have exactly the same meaning. IMO, they're a legacy burden we bear, not something we want to build off of.
  3. Creating multiple resources in a single request is a recipe for network and sync problems. This is actually how we used to create observations in our own apps: a single request would have the obs and the media. This led to a lot of problems in mobile devices with limited bandwidth where the request would time out b/c of the photo and the obs would never get created. So, supporting this kind of behavior is not something I think we should do in v2 of the API

So, those are all reasons why I don't personally think we should do or integrate work toward the stated goal. However, with zero additional work, you can do this:

curl -X POST \
  --header 'Content-Type: application/json' \
  --header 'Accept: application/json' \
  --header 'Authorization: YOUR_JWT' \
  -d '{
    "observation": {
      "description": "multiple obs field value test",
      "observation_field_values_attributes": {
        "0": {
          "observation_field_id": 1,
          "value": 21
        },
        "1": {
          "observation_field_id": 2,
          "value": "foraging"
        }
      }
    }
  }' \
  https://api.inaturalist.org/v1/observations

The caveat is that as undocumented functionality this might be removed at any time without notice, so while it's doable, I would not advise building critical functionality around it.

tomsaleeba commented 4 years ago

provide flexibility for citizen science projects that wanted a bit more than the data model we were providing

citizen science projects that wanted a bit more than the data model we were providing

You've hit the nail on the head as to who we are and why we chose to use obs fields.

they've become a giant, confusing, uncontrolled mess where there are many cases in which 10+ fields have exactly the same meaning

We found sifting through the existing obs fields to be a bit chaotic but then we are guilty of adding more duplication as we've created all our own fields. Sorry about that. We try to use the core iNat model as much as possible and only add obs fields where the core model doesn't cover our need.

It's a shame that obs fields are deprecated as gathering this extra data is the point of difference for our (and most?) citizen science project. We want to create more than just a basic observation and ensure there's rigour in the data collection so the data is useful to scientists at the other end. Having useful data is the reason the data is collected.

I'd suggest that this extra data we're collecting is not specific to our app. It's not data only used by our app, like user preferences, but part of the observation and should be useful to anyone interested in that observation.

We considered storing the data in a separate system but that has some drawbacks:

I understand how difficult it is to make a system that generically handles anything, like the obs field feature. And I get that you're not building a platform for other projects to run on but as part of the community, we really like obs fields. We're not trying to run a completely separate project that uses iNat as a datastore (which is not something I'd consider anyway). Our goal is to provide something that reduces friction for creating rich orchid observations in iNat, that are useful to scientists. This basically boils down to implementing business logic for collecting coherent data (like only asking for the "host tree species" when the orchid is an epiphyte) and providing an orchid lens to things like species autocomplete.

We wanted to use iNat becuase a lot of our userbase is already on the system and if we did not build on iNat, a lot of people would ask us why not. People want to use iNat, which is great! We hope that bringing more users and observations to iNat will only benefit everyone involved.

The obs fields we're collecting are picked by credentialed scientists based on published science. If there's an opportunity to get them more integrated into the iNat model, we'd be open to that. Perhaps something like annotations? But then as you mentioned here, that's something we'd need to run by the wider community.

The fact that obs fields are deprecated is a risk to our project. As a software engineer, I want to make something that lasts, and that includes not losing data, so relying on obs fields when they might disappear makes me uneasy. I'm not sure if you've thought this far ahead but if obs fields are removed, would they be removed in newer API versions but honoured in older versions or completedly removed everywhere. Also, would the data in obs fields be lost or locked as read-only?

By the way, that undocumented nesting of obs fields in the obs request is exactly what we need :heart_eyes: are you sure we can't use it :pleading_face: :pray:

kueda commented 4 years ago

Thanks for the insight into your project, Tom, definitely a useful perspective to keep in mind if/when we think about obs fields in the future. I don't think obs fields are at risk of being deleted, but, as you point out, we might choose to stop supporting obs field CRUD in future versions of the API, though given they are currently the only recourse for projects who want to store some extra pieces of data, I guess I have to grudgingly admit even that seems unlikely.

A few replies:

the data is harder to access as you need to either go to two locations or always use the non-iNat system which can act as a facade for the data stored in iNat

IMO, using obs fields makes the data harder to access for a different reason: since the obs fields you use are not generic, a potential consumer must know they exist a priori in order to utilize them, unless they just happen to stumble upon a record from your project. For example, if we were to build a better export tool and a data consumer wanted to get flowering phenology data from obs fields, they might type "phenology" and get 37 results, some of which are your project's pheno fields, some of which are not, all of which seem to have different structures and semantics. If the consumer already knows about your project and they want data from it, getting it from a source that you control allows you give them what they want without the distraction of all the other zany things that are on iNat.

If there's an opportunity to get them more integrated into the iNat model, we'd be open to that. Perhaps something like annotations?

This is definitely the route we'd prefer for future metadata, but it comes with the very high cost of all standards: shared semantics, interoperable schemas, etc. And I don't think it's a great fit for the majority of your project's fields, e.g. WOW Rock cover size - Boulders 600 - 2000 mm, which seems very project-specific.

By the way, that undocumented nesting of obs fields in the obs request is exactly what we need 😍 are you sure we can't use it 🥺 🙏

You can use it, but you have to use it at your own risk, b/c as an undocumented part of the API we make no commitment to maintaining it and it could disappear at any time without notice. I doubt we will unless people start abusing it, but that's where it stands and I don't personally think we should make a committment to support it for the reasons I've stated.

tomsaleeba commented 4 years ago

You make some good points re: how useful the data from our fields would be when taken as part of the whole system.

We appreciate you allowing us to continue using obs fields. This project is almost complete and in the process of hand-over to the client, so large changes aren't really on the table right now.

I've just deployed our usage of this everything-at-once API call to production on Friday and it's working well, so I'll close this issue.

Thanks for the help :+1: