Open bart1 opened 2 years ago
In my experience, when downloading from the api you will always need deployment_id. In general, the event data is quite messy and you need to do a lot of filtering to extract the valid data. In many cases, you will have a large number (sometimes millions!) of events that are associated with the project but are not actually associated with any individual. My understanding is this occurs because base stations sometimes grab data from any available tag, even if it is an unknown individual not associated with the project. One way to filter out these events is to examine the deployment_id. These types of records will have deployment_id = null. You also want to filter on the dates in the deployment table, since the API will also send events that are outside the deployment window.
There are lots of other quirks associated with the api data. I have a script that cleans the data and addresses all of the issues that I'm aware of here: https://github.com/benscarlson/mosey_db/blob/master/db/clean_study.r
I'm also trying to document the issues in the format of a guide, but haven't had a lot of time to update that. If you are interested, you can see what I have here. Mostly this goes over another tricky issue regarding timestamps and pseudo-duplicates. https://github.com/benscarlson/mosey_get/blob/master/guide_api_data.md
After I do all of this filtering, the total number of events remaining is nearly identical to the number reported for "Number of deployed locations (GPS)" on the movebank Study Details page. This means that somewhere movebank is doing the same sort of filtering. I've always wondered what that script looked like.
Currently the deployment id (local or internal) will never be reported unless you request it. Where available the deployment_local_identifier is preferred as it will match the user's naming. There are lots of reasons why studies contain undeployed events. Keep in mind many researchers are sending data to Movebank as they are being collected, rather than a pre-cleaned dataset: they may send data while testing equipment prior to deployment, and tags can collect and transmit data after a tag has fallen off or an animal has died. A crucial job of data owners is to define when tag data are associated with an animal. Data not associated with an animal should have no deployment id, no animal id, and no associated taxon. The minimal event data request that I recommend is here: https://github.com/movebank/movebank-api-doc/blob/master/movebank-api.md#get-event-data-with-select-additional-event-level-attributes
When downloading from the API its not always clear when deployment information is relevant. By default if you download
event
data thedeployment_id
is not included (e.g. https://github.com/movebank/movebank-api-doc/blob/master/movebank-api.md#get-event-data-from-a-study). The 'tag_id' and 'individual_id' will not conclusively give information about deployments (e.g. a tag might be deployed twice on one individual). Isdeployment_id
reported in those cases? Or should I always explicitly request it? It might be an idea to include it by default.