movebank / movebank-api-doc

Description of download interface to build calls to the Movebank database using HTTP/CSV or JSON/JavaScript requests
142 stars 19 forks source link

When is deployment information included in downloading event data? #11

Open bart1 opened 2 years ago

bart1 commented 2 years ago

When downloading from the API its not always clear when deployment information is relevant. By default if you download event data the deployment_id is not included (e.g. https://github.com/movebank/movebank-api-doc/blob/master/movebank-api.md#get-event-data-from-a-study). The 'tag_id' and 'individual_id' will not conclusively give information about deployments (e.g. a tag might be deployed twice on one individual). Is deployment_id reported in those cases? Or should I always explicitly request it? It might be an idea to include it by default.

benscarlson commented 2 years ago

In my experience, when downloading from the api you will always need deployment_id. In general, the event data is quite messy and you need to do a lot of filtering to extract the valid data. In many cases, you will have a large number (sometimes millions!) of events that are associated with the project but are not actually associated with any individual. My understanding is this occurs because base stations sometimes grab data from any available tag, even if it is an unknown individual not associated with the project. One way to filter out these events is to examine the deployment_id. These types of records will have deployment_id = null. You also want to filter on the dates in the deployment table, since the API will also send events that are outside the deployment window.

There are lots of other quirks associated with the api data. I have a script that cleans the data and addresses all of the issues that I'm aware of here: https://github.com/benscarlson/mosey_db/blob/master/db/clean_study.r

I'm also trying to document the issues in the format of a guide, but haven't had a lot of time to update that. If you are interested, you can see what I have here. Mostly this goes over another tricky issue regarding timestamps and pseudo-duplicates. https://github.com/benscarlson/mosey_get/blob/master/guide_api_data.md

After I do all of this filtering, the total number of events remaining is nearly identical to the number reported for "Number of deployed locations (GPS)" on the movebank Study Details page. This means that somewhere movebank is doing the same sort of filtering. I've always wondered what that script looked like.

sarahcd commented 2 years ago

Currently the deployment id (local or internal) will never be reported unless you request it. Where available the deployment_local_identifier is preferred as it will match the user's naming. There are lots of reasons why studies contain undeployed events. Keep in mind many researchers are sending data to Movebank as they are being collected, rather than a pre-cleaned dataset: they may send data while testing equipment prior to deployment, and tags can collect and transmit data after a tag has fallen off or an animal has died. A crucial job of data owners is to define when tag data are associated with an animal. Data not associated with an animal should have no deployment id, no animal id, and no associated taxon. The minimal event data request that I recommend is here: https://github.com/movebank/movebank-api-doc/blob/master/movebank-api.md#get-event-data-with-select-additional-event-level-attributes