tdwg / dwc-qa

Public question and answer site for discussions about Darwin Core
Apache License 2.0
49 stars 8 forks source link

dwc:datasetName use to group datasets, yes, no, options please? #199

Open debpaul opened 1 year ago

debpaul commented 1 year ago

Question about the use of the dwc:datasetName field.

Scenario: different groups, as part of a formal national initiative, will each observe the same taxonomic group in their respective region. Some of the observed taxa will be collected and vouchered in various collections across this given nation.

General Puzzlement One: Each group publishes their own dataset to GBIF (observations and specimen records). How do these disparate datasets find each other? How can they be grouped "after-the-publishing step?"

Specific Puzzlement Two: What if each group gave their own dataset the same name? Example: dwc:datasetName = Our [taxonomic group] Data. Would this work? Say, for publishing to GBIF, does it matter if two datasets have the same datasetName?

Last Puzzlement: Is there a better strategy? What are the options (standard terms? extensions?) for ensuring (or at least improving the chances) these data can be aggregated and understood to be part of the same initiative? Would this be a use case for a term that conveyed a funding number (a grant number)? In that way, all datasets (and for that matter, data records) would be gatherable by that?

Insights, discussion, and options welcome. I'm guessing others have already solved this or at least grok current possible options as well as needs to make this a reasonable approach to a distributed national-level project.

ben-norton commented 1 year ago

This is a significant issue in camera trapping. Most of the major projects (e.g., eMammal, Snapshot USA, Wildlife Insights) are collaborations between multiple institutions. These are referred to as 'initiatives' since they are larger than one 'organization' or 'institution'. Within those initiatives, providers may submit their own datasets as part of the effort. Let's say a researcher publishes their camera trap dataset to GBIF. They manage their data in Wildlife Insights. Following the iNaturalist model, Wildlife Insights publishes their data to GBIF. This results in duplicate datasets on GBIF. The Wildlife Insights data will be significantly larger, but that doesn't negate the duplicate issue. Scenario 2 Let's say a researcher publishes their camera trap dataset to GBIF. They manage their data using Wildlife Insights. Wildlife Insights doesn't publish data to GBIF. The researcher and Wildlife Insights would like to connect the dataset to other Wildlife Insights datasets on GBIF. Here, instead of Wildlife Insights publishing in bulk, a collection of datasets are connected, which as a whole represent the initiative.

debpaul commented 1 year ago

@ben-norton Scenario 2 is what I'm thinking about (although I'd guess in the situation I'm thinking about, your first scenario is very likely going to happen).

It makes me think hard from a Latimer Core perspective. We're really talking about whole/parts relationships and the many variables around which we might pivot or group data.

So, for a given distributed project, then a) They would agree to only submit to GBIF once b) There'd be a field (Grant Number?) around which their data could be grouped. c) (Or maybe each group in this consortium has a group ID? that goes with the grant number?)

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

d) Even in that case ^^^ The original project would still like to see / grab through the API / visualize their aggregated data.

Where does this leave us? Are there standards in place to help us do this? Or do we have a gap?

Jegelewicz commented 1 year ago

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is.

debpaul commented 1 year ago

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is.

Gotcha @Jegelewicz although in this case I'm really talking about projects, not really institutions. We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number. Something like this would be useful. When you start to tease it apart, many specimens will be touched / imaged / sampled / etc in connection to different grants. So it's also a one:many thing. We need a way to group the objects around that grant number...

Jegelewicz commented 1 year ago

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I could see this being covered by the Identifier class in LatimerCore.

debpaul commented 1 year ago

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I could see this being covered by the Identifier class in LatimerCore.

Thanks! It's definitely parallel to the idea of pivoting different parts of the same collection in different ways.

ymgan commented 1 year ago

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO

For datasets with multiple projects, an issue is opened here: https://github.com/gbif/ipt/issues/1780

debpaul commented 1 year ago

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO

For datasets with multiple projects, an issue is opened here: gbif/ipt#1780

@ymgan in the scenario I'm describing, various groups across the USA would be collecting data (observations and specimens) on their own in their own areas of the USA. They'd be using a standard protocol. The goal, would be to have all these distributed sets be able to come together using a particular data point. Perhaps this Project ID in the EML could do that then. Does this sound parallel to what you are describing?

ymgan commented 1 year ago

Yes, that is parallel to what I am describing. However, it is at dataset level though. For the record level, indeed datasetName and datasetID seem to be for this purpose:

@dagendresen made a good remark here: https://github.com/gbif/pipelines/issues/665#issuecomment-1261672298

One important reason or rationale is to group records produced or updated from different project funding. Similar to how the GBIF BID, BIFA, and CESP projects list datasets produced by this project funding. However, often we see project funding for georeferencing, or taxonomic validation and desire to "tag" the data records (or actually ultimately rather desire to "tag" the actual real-world collection specimens) that were georeferenced from a specific project funding --> to credit the funder and track fulfillment of the promise to the funder of e.g. georeferencing 10 000 collection specimens...

MattBlissett commented 1 year ago

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

See https://data-blog.gbif.org/post/clustering-occurrences/, which describes what we're already doing and references Nicky's work.