Literature-based data including species trait data (3.5.4)

Sharif and I added some content to the Blueprint document regarding literature-based data sources. I'll add it here as well for convenience:

D3.1 DTO-BioFlow data flow blueprint

3.5.1 Existing data flows

For the case of literature-based data, it is sourced directly from different data integrators (Plazi, OpenBioDiv, and SIBiLS, also including other actors). An overview diagram of the connections between different data integrators and repositories in this area is shown below:

BiCIKL-GBIF1 Image credit: BiCIKL project showing ecosystem of widely integrated services to create and provide access to data in literature.

BiCIKL-ENA-Treatment-Bank Image credit: BiCIKL project showing sequence data linkages.

3.5.2 Data not in EMODnet Biology

Literature-based data from the previously mentioned data integrators (Plazi, OpenBioDiv, and SIBiLS) is not currently in EMODnet Biology. This type of data is probably not fitting for submission to EMODnet Biology.

3.5.3 Sustainable Ingestion procedures towards EMODnet Biology and the DTO

Because of the particularities of literature-based data, it probably does not make sense to submit it to EMODnet Biology (nor fits its requirements), and the same applies to the DTO data lake. That's because these data consist of journal articles, text files or structured metadata that meet a certain search query (e.g. involving a specific taxon), as well as other types of information (e.g. knowledge graphs, ontologies...). However, it could be interesting to integrate these resources indirectly into the rest of the technical architecture. It's something to be discussed with WP5.

3.5.4. Extraction and processing into harmonised and fit for purpose science-based data products

By exploring various data links between literature and biodiversity resources, the BiCIKL project has enhanced discovery and knowledge creation. These links, for example, provide information biotic interaction, connect species names in literature to taxonomic databases, DNA sequences, and type specimen references. Structured and persistent access to such linked data is crucial, as it facilitates efficient information harvesting through project outputs. Notably, some of these links are already available via SPARQL and REST endpoints, large data dumps, and nanopublications.

3.5.4.1 Implementation of standards, quality assurance, data models and communication protocols

Literature-based data is closely connected to the area of standards and data models. For example, one of the data sources (OpenBiodiv) involves an ontology about biodiversity (OpenBiodiv-O). Similarly, SIBiLS makes use of different vocabularies to process its results, while TreatmentBank (from Plazi) has a strong emphasis on implementing the FAIR principles.

iobis / dto-bioflow-3.5