tdwg / camtrap-dp

Camera Trap Data Package (Camtrap DP)
https://camtrap-dp.tdwg.org
MIT License
44 stars 5 forks source link

Is there a usecase for the _id columns? #179

Closed peterdesmet closed 1 year ago

peterdesmet commented 3 years ago

@kbubnicki do you have a use case for the _id columns in the csv files? I think that if internal system identifiers are important enough to be included (often very useful for reference), then these should be (part of) the primary keys (deploymentID, mediaID, observationID).

Especially now that all columns need to be included, it becomes a bit confusing what to populate _id with if the internal identifiers are already in the primary keys: repeat them in _id or leave it that field empty? I'd rather remove that ambiguity and not have _id, unless there is a strong use case for them.

Note that I do use the project._id in the metadata. I would actually prefer if that one got bumped up to id rather than _id.

kbubnicki commented 3 years ago

@kbubnicki do you have a use case for the _id columns in the csv files? I think that if internal system identifiers are important enough to be included (often very useful for reference), then these should be (part of) the primary keys (deploymentID, mediaID, observationID).

I agree with you that these internal system identifiers are important to be included and thats why I have initially proposed these attributes. Please note that e.g. deploymentID can be different from _id as it can be more verbose or more human-readable (e.g. "dep2-loc1" etc) where _id is always supposed to be a (numeric) primary key from an internal database (e.g. 3456792) of a system that produced a data package. So the use case is to support both types of information when available.

peterdesmet commented 3 years ago
  1. I think that supporting both types of information can make it confusing for creators of Camtrap Data Packages and that it is easier to recommend have e.g. deploymentID to be or at least include the original identifier. How are you planning to do this for Trapper?
  2. Would you be fine upgrading project._id to project.id?
kbubnicki commented 3 years ago
  1. Ok, potentially this can be slightly confusing but all _id columns are internal attributes and they are not required. I would propose to keep this flexibility to have an option to support both cases as described before. In Trapper, deploymentID is always user-defined where _id is just a db primary key.
  2. To be consistent with the other _ids I would not change it :)
peterdesmet commented 3 years ago
  1. šŸ‘ I guess I won't populate it in the example then, since it's the same as the deploymentID anyway.
  2. I would rather make it consistent with package id or resource id. It is a property I refer to quite often and in contrast with _id in the csv, it is not an alternative for something else. It is currently also the only underscore property left in the metadata (since we bumped up platform to a proper property).
kbubnicki commented 3 years ago
  1. Ok, fair enough.
peterdesmet commented 2 years ago

Reopening this issue, because it has downstream consequences and might still be confusing to end users. If the use case is to support:

In Trapper, deploymentID is always user-defined where _id is just a db primary key.

Then maybe we should support that use case specifically, rather than having private properties for every data file.

kbubnicki commented 1 year ago

I see that this issue has been already solved on GBIF side: https://github.com/gbif/ipt/commit/527d32f70121f2d20d0988388a65c54c6b48448b

Then I would prefer to leave it as it makes easier to track original objects in a data package provider's database.

peterdesmet commented 1 year ago

Out of curiousity: for what objects (dep, media, obs) would you populate deploymentID, mediaID or observationID with something different that the original id in the providerā€™s database?

kbubnicki commented 1 year ago

Good question ;) Actually in our case for all dep, media and obs we populate different ids than those in db. But this could be easily changed, the point is rather that you do not know if deploymentID, mediaID or observationID in a package ARE db ids. They can be db ids but the can be not, we do not have such requirement. Then with _id you can be sure that the values provided are db ids. The other question is how useful is this information for end users (humans or other platforms) but thats what I was trying to defend above.

peterdesmet commented 1 year ago

Well, Iā€™ve been working in biodiversity informatics for many years and itā€™s the first time I see this request come up šŸ˜Š. Which is why Iā€™m hesitant to add it, because these extra fields are not without consequences as it begs questions like: should I populate this, should I populate it with something different, should I come up with my own deploymentIDs then, etc. Iā€™d rather avoid all that and advise people to use internal identifiers as stable identifiers in deploymentID, mediaID and observationID instead, especially since those identifiers are used for relationships between files.

Would be interested to hear what others think @tucotuco @ben-norton @yliefting

tucotuco commented 1 year ago

While I am not vested in any particular outcome, experience makes me wonder at having two identifiers unless they have distinct practical purposes. What value does an internal db identifier have in a shared data scenario if it is not required and the other identifier is? The local db-interested party could always figure out the local id from the shared required one.

peterdesmet commented 1 year ago

Discussed with @kbubnicki, he fine with removing the _id columns.

ben-norton commented 1 year ago

In regards to identifiers, I use auto-incremental integers for primary and foreign keys. The numeric ids are only unique within the context of a single table in my backend relational database. Therefore, I typically avoid publishing them. Instead, I associated a guid with each numeric key, then publish the guids instead. I eventually need to simplify this by using guids as primary and foreign keys, but thats beside the point. On average humans can store 7 +/- 2 digits in working memory. GUIDs contain 36. Numeric ids are often less than 7 digits. Back to your question, I would publish GUIDs instead of numeric ids produced by my database. I'm not sure how common that is.

peterdesmet commented 1 year ago

@ben-norton thanks! Personally, I don't think there is any issue with numerical identifiers, as long as they are stable and unique within the dataset. Global uniqueness can be achieved by data aggregators, by combining the identifier with e.g. the dataset identifier. I also like the your numerical identifiers allow to find the record in the source dataset (very useful if you get feedback on a record). So, in my opinion it is not necessary for you to generate GUIDs, but you are of course welcome to do so.

Numerical vs GUIDs has no influence on Camtrap DP: identifiers can be any (unique) string. And we dropped internal _id to simplify things and nudge people to only use one identifier to refer to a concept.