Closed peterdesmet closed 1 year ago
@kbubnicki do you have a use case for the
_id
columns in the csv files? I think that if internal system identifiers are important enough to be included (often very useful for reference), then these should be (part of) the primary keys (deploymentID
,mediaID
,observationID
).
I agree with you that these internal system identifiers are important to be included and thats why I have initially proposed these attributes. Please note that e.g. deploymentID
can be different from _id
as it can be more verbose or more human-readable (e.g. "dep2-loc1" etc) where _id
is always supposed to be a (numeric) primary key from an internal database (e.g. 3456792) of a system that produced a data package. So the use case is to support both types of information when available.
deploymentID
to be or at least include the original identifier. How are you planning to do this for Trapper?project._id
to project.id
?_id
columns are internal attributes and they are not required. I would propose to keep this flexibility to have an option to support both cases as described before. In Trapper, deploymentID
is always user-defined where _id
is just a db primary key._id
s I would not change it :)deploymentID
anyway.id
or resource id
. It is a property I refer to quite often and in contrast with _id
in the csv, it is not an alternative for something else. It is currently also the only underscore property left in the metadata (since we bumped up platform
to a proper property).Reopening this issue, because it has downstream consequences and might still be confusing to end users. If the use case is to support:
In Trapper,
deploymentID
is always user-defined where_id
is just a db primary key.
Then maybe we should support that use case specifically, rather than having private properties for every data file.
I see that this issue has been already solved on GBIF side: https://github.com/gbif/ipt/commit/527d32f70121f2d20d0988388a65c54c6b48448b
Then I would prefer to leave it as it makes easier to track original objects in a data package provider's database.
Out of curiousity: for what objects (dep, media, obs) would you populate deploymentID, mediaID or observationID with something different that the original id in the providerās database?
Good question ;) Actually in our case for all dep, media and obs we populate different ids than those in db. But this could be easily changed, the point is rather that you do not know if deploymentID, mediaID or observationID in a package ARE db ids. They can be db ids but the can be not, we do not have such requirement. Then with _id
you can be sure that the values provided are db ids. The other question is how useful is this information for end users (humans or other platforms) but thats what I was trying to defend above.
Well, Iāve been working in biodiversity informatics for many years and itās the first time I see this request come up š. Which is why Iām hesitant to add it, because these extra fields are not without consequences as it begs questions like: should I populate this, should I populate it with something different, should I come up with my own deploymentIDs then, etc. Iād rather avoid all that and advise people to use internal identifiers as stable identifiers in deploymentID, mediaID and observationID instead, especially since those identifiers are used for relationships between files.
Would be interested to hear what others think @tucotuco @ben-norton @yliefting
While I am not vested in any particular outcome, experience makes me wonder at having two identifiers unless they have distinct practical purposes. What value does an internal db identifier have in a shared data scenario if it is not required and the other identifier is? The local db-interested party could always figure out the local id from the shared required one.
Discussed with @kbubnicki, he fine with removing the _id
columns.
In regards to identifiers, I use auto-incremental integers for primary and foreign keys. The numeric ids are only unique within the context of a single table in my backend relational database. Therefore, I typically avoid publishing them. Instead, I associated a guid with each numeric key, then publish the guids instead. I eventually need to simplify this by using guids as primary and foreign keys, but thats beside the point. On average humans can store 7 +/- 2 digits in working memory. GUIDs contain 36. Numeric ids are often less than 7 digits. Back to your question, I would publish GUIDs instead of numeric ids produced by my database. I'm not sure how common that is.
@ben-norton thanks! Personally, I don't think there is any issue with numerical identifiers, as long as they are stable and unique within the dataset. Global uniqueness can be achieved by data aggregators, by combining the identifier with e.g. the dataset identifier. I also like the your numerical identifiers allow to find the record in the source dataset (very useful if you get feedback on a record). So, in my opinion it is not necessary for you to generate GUIDs, but you are of course welcome to do so.
Numerical vs GUIDs has no influence on Camtrap DP: identifiers can be any (unique) string. And we dropped internal _id
to simplify things and nudge people to only use one identifier to refer to a concept.
@kbubnicki do you have a use case for the
_id
columns in the csv files? I think that if internal system identifiers are important enough to be included (often very useful for reference), then these should be (part of) the primary keys (deploymentID
,mediaID
,observationID
).Especially now that all columns need to be included, it becomes a bit confusing what to populate
_id
with if the internal identifiers are already in the primary keys: repeat them in_id
or leave it that field empty? I'd rather remove that ambiguity and not have_id
, unless there is a strong use case for them.Note that I do use the
project._id
in the metadata. I would actually prefer if that one got bumped up toid
rather than_id
.