tdwg / dwc-qa

Public question and answer site for discussions about Darwin Core
Apache License 2.0
49 stars 8 forks source link

Technical information needed (or useful) in Darwin Core? #170

Open DimEvil opened 3 years ago

DimEvil commented 3 years ago

We publish our data following a set of well defined workflows, coming from well defined data systems or databases or project/applications. I think this could be valuable information when publishing a dataset. Especially, when it comes to filtering data on a data aggregator website (i.e. GBIF). For example. We publish several datasets coming from a database we call NBN or a project named MICA, or another database named VIS and plenty of others systems. But at the moment there is no simple way for me te get all the data originating from one of these databases on the record level from an aggregator. It would be very nice if I could query GBIF for all the occurrences coming from the MICA project or all the records originating from the NBN database. Which DwC term can be used for this information? collectionCode looks promising but I'm not convinced (collectionCode: The name, acronym, coden, or initialism identifying the collection or data set from which the record was derived.). Or the explanation is not clear...

pzermoglio commented 3 years ago

For projects, and only if to search on GBIF, you can use the project ID available on IPT metadata tabs. The GBIF portal allows now searching for project IDs - although so far for single ones, not a list.

ManonGros commented 3 years ago

Note that on GBIF, the collectionCode is now used to link occurrences to GRSciColl entries (the same for institutionCode). This means that if you use a code that already exists in GRSciColl, the occurrences will be displayed on the corresponding page (see this example). More information on how this matching is done here. But if you would like to have a specific GRSciColl entry linked to some occurrence, this is the way to go.

qgroom commented 3 years ago

I wonder if some of the Darwin Core extensions might help

For example the Literature References extension could link an observation to any citable resource

Also Web links (apparently under development)

I will also mention the issue within the Agent Extension Task Group. Perhaps a project can be considered an "agent" and an appropriate action would be used to link the observation.

DimEvil commented 3 years ago

Hi, We do use the projectID, but this is not forfilling my needs. I think that how (workflow) the data is published or the name of the original database (which can lead to a lot of different datasets) can be seen as a property of the data. GRIsiCol is more about scientific collections (I'm thinking of specimens and a name of a collection). I'm thinking about the name of a databse for example. I'm not 100% sure of this would be a big win of information for worldwide users, but it s definitely a win for an institution on itself.

For example: I want all the occurrences coming from the VIS database (We published 4 VIS-datasets):

In GBIF I can do this: https://www.gbif.org/dataset/search?q=VIS and it gives me a list of datasets with VIS in the tittle I can search for occurrences and look for where 'VIS' is available in the title, mark them and I.

If the Acronym 'VIS' would be a property of the data, I could search occurrences and indicate if it would be 'collectionCode' VIS and immediately see all the occurrence originated from the VIS database.

and thnx for all the answers sofar!

tucotuco commented 3 years ago

I wonder, @ManonGros, if the networks capabilities of the GBIF registry might be a good solution for @DimEvil?

DimEvil commented 3 years ago

Hi, I used VIS now as 'virtual' collectionCode and it gives me exactly what I want https://www.gbif.org/occurrence/search?collection_code=VIS&occurrence_status=present is providing all records from the VIS database over several datasets...

But still I think that technical information about the datasets would be usefull in DwC

peterdesmet commented 3 years ago

@DimEvil what technical information do you want to express in addition to the source collection/database system?

@tucotuco networks are indeed a good way to collect datasets related to a project/community and have the advantage that one dataset can belong to multiple networks. See also the suggestion to make registration with a network easier: https://github.com/gbif/ipt/issues/986#issuecomment-828390951

DimEvil commented 3 years ago

@peterdesmet I'm providing technical information in collectionCode now, which is like not 100% correct, I would rather do this correct. Undoubtedly there is more valuable info possible, but this needs some thinking I supose.

peterdesmet commented 3 years ago

I think your use of the term is correct: collectionCode (... identifying the collection or data set from which the record was derived) is imo a suitable term to indicate the source database for non-specimen records. Additional technical information regarding provenance or data standardization steps are imo best expressed in the "Method steps" in the metadata, e.g. https://www.gbif.org/dataset/8a5cbaec-2839-4471-9e1d-98df301095dd#methodology

DimEvil commented 3 years ago

I think collectionCode originally was intended to define the fysical collection a specimen belongs to, not the digital collection. This works when you only difeined a fysical collection or a 'digital' collection. Example for RBINS museum, what if the specimen is in the vertebrate collection, (collectionCode = VERTEBRATEN ) and digitally in the DARWIN database?