tdwg / dwc-qa

Public question and answer site for discussions about Darwin Core
Apache License 2.0
49 stars 8 forks source link

what's the difference between DwC and DwC-A? #95

Closed ekrimmel closed 6 years ago

ekrimmel commented 7 years ago

discussion from DwC Hour #0: Introduction to Darwin Core Hour...

Brian: Pretty basic, but could you differentiate DwC from DwC Archive?

Erica Krimmel: @Brian, DwC are the rules and a DwC archive is data/metadata generated according to those rules.

Laura Russell: Brian, DwC is the Standard, DwCA is the text output mapping the standard with the actual data. See http://rs.tdwg.org/dwc/terms/guides/text/index.htm

debpaul commented 7 years ago

Hi @ekrimmel, here's an idea for documentation on this question.

Propose Wiki page title: DwC data standard vs DwC Archive data sharing package

Question: What is the difference between the Darwin Core Standard and a Darwin Core Archive?

Quick answer: the Darwin Core Standard is a list of terms, while the Darwin Core Archive is a package -- an agreed upon format for all the compiled data and metadata.

Answer: "Darwin Core is a data standard for publishing and integrating biodiversity information. (Wieczorek et al 2012)." Loosely speaking, it is a bag of terms the broader community agrees to use when mobilizing biodiversity data with the outside world. Some examples of DwC standard terms include: dwc:geodeticDatum, dwc:habitat, dwc:eventDate. Each term in the standard has a definition that explains the concept for the term to help us know what field in our data is the best match. Perhaps in your local database, you have a field for collectingDate that is expected to be verbatim (that is, as provided on the specimen label). When you export your data in a file, using Darwin Core standard terms, you would re-name this field verbatimEventDate. The process of figuring out the match between your local data source and the correct (most appropriate) Darwin Core term is called mapping. It may be obvious to you, now that you've read this, but this step is just the first one that must be done before everyone can add their data to a common data pile such that you might find in VertNet, GBIF, iDigBio, Canadensys, ALA, etc.

Once your biodiversity data is mapped it can be shared or published and successfully aggregated and integrated with other such data. But now we have an analogous challenge to the one above. If we are to compile (aggregate) data from different sources, we must figure out what the data package looks like that we are going to share. Humans and computers need to be able to understand the data. A (sort-of okay) example might be that when we address an envelope, we know that line 1 is the name of the person / organization, line 2 is the street address, and line 3 is the city, state, and all important zip code. And, we always provide a return address so the humans and computers can figure out the provenance of the package. This standardization on the package label allows machine sorting and routing. Today, for mailing some packages, we are also required to declare what is inside the package.

Similarly, before we can publish (share) our mapped data, we need a standard data sharing format to tell humans and computers, where the data comes from, and what is inside the data package. The Biodiversity Data Standards (TDWG) community, in concert with GBIF, designed and defined such a data package and named it Darwin Core Archive (DwC-A).

Question: So, what is inside a DwC-A?

Quick Answer: Multiple files. You would find at least: 1) your data file containing information about your museum specimens or field observations, 2) an index file that tells the computer what columns (standard terms) are present in your data file, and 3) a metadata file that contains information about who is providing this data (your organization and contact information, for example).

Answer: (adapted from Darwin Core Archive Assistant) Darwin Core Archive (DwC-A) is a Biodiversity informatics data packaging standard that makes use of the Darwin Core terms to produce a single, self contained dataset for species occurrence or taxonomic (species) data. It is the preferred format for publishing data to the Global Biodiversity Information Facility GBIF. You export your data as a set of one or more text (CSV) files. A simple XML descriptor file (called meta.xml) is required to inform others how your files are organized.

To @ekrimmel - this certainly needs work and graphics, but it's a start. There are other explanations, presentations, graphics. Would be good to link to good ones here...

Links to other explanations: Darwin Core Archive Assistant Darwin Core Archive on Wikipedia

Citation: Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, et al. (2012) Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE 7(1): e29715. https://doi.org/10.1371/journal.pone.0029715

ekrimmel commented 6 years ago

documented at https://github.com/tdwg/dwc-qa/wiki/DwC-vs-DwC-A