globalbioticinteractions / refuted-biotic-interactions-by-eol

Biotic interaction data that failed Encyclopedia of Life data validation
0 stars 0 forks source link

suggest to add columns to include verbatim refuted record data #2

Closed jhpoelen closed 4 years ago

jhpoelen commented 4 years ago

In order to understand which interaction record is being refuted by EOL validation rules, we need to reference the refuted data record somehow.

Because existing datasets apply various ways (or no way) to identify shared records (e.g., occurrence id etc), the most pragmatic way to reference interaction data records is to include most/all the data they contain.

Currently, refuted data records shared by EOL include some reference to the refuted data:

argumentTypeId argumentTypeName interactionTypeId interactionTypeName sourceTaxonId sourceTaxonName sourceTaxonRank sourceTaxonKingdomName targetTaxonId targetTaxonName targetTaxonRank targetTaxonKingdomName
https://en.wiktionary.org/wiki/refute refute http://purl.obolibrary.org/obo/RO_0002623 has flowers visited by EOL:1046642 Andrena imitatrix species Animalia GBIF:3251867 Grossularia genus Animalia

However, to make the refuted data records more explicit, I'd like to propose to add all/most data from the refuted records with original (dwc) columnName prefixed with refuted:. This would be in addition to the existing fields like sourceTaxonName, interactionTypeId, targetTaxonName etc.

This might look something like:

argumentTypeId argumentTypeName sourceTaxonName sourceTaxonRank sourceTaxonKingdomName interactionTypeId interactionTypeName targetTaxonName targetTaxonRank targetTaxonKingdomName refuted:http://eol.org/schema/Association:http://eol.org/schema/associationID refuted:http://eol.org/schema/Association:http://eol.org/schema/associationType refuted:http://eol.org/schema/Association:http://rs.tdwg.org/dwc/terms/occurrenceID ...
https://en.wiktionary.org/wiki/refute refute Andrena imitatrix species Animalia http://purl.obolibrary.org/obo/RO_0002623 has flowers visited by Grossularia genus Animalia EOLrefute_globi:assoc:23953779-EOL:1046642-VISITS_FLOWERS_OF-GBIF:3251867 http://purl.obolibrary.org/obo/RO_0002623 ... ...

Please note that I excluded citation info columns for sake of simplicity.

Also, note a proposed notation for the columns refuted:[row type]:[term uri] (e.g., refuted:http://eol.org/schema/Association:http://rs.tdwg.org/dwc/terms/occurrenceID) can probably be imagined some other way. In addition, I leave it up to you to make include all or just a subset of the available dwc terms provided via the GloBI DwC .

@KatjaSchulz as discussed, I made an attempt to document the slight change in the refutation table format that you and Eli provide. Please let me know if this makes sense.

KatjaSchulz commented 4 years ago

Looks good. I'll let you know when we have something to share.

KatjaSchulz commented 4 years ago

Looking in some more detail at the implementation of this, I've come up against the following issue. For each refuted record, we need to reference both source and target occurrences/taxa. Using the proposed notation for the columns (refuted:[row type]:[term uri]), there is no indication which occurrence/taxon value refers to the source or target of the association. This can of course be reconstructed from the refuted:http://eol.org/schema/Association:http://rs.tdwg.org/dwc/terms/occurrenceID and refuted:http://eol.org/schema/Association:http://eol.org/schema/targetOccurrenceID values, but that's not a particularly human-readable solution. Also, you end up with two versions of each occurrence and taxon column, e.g., refuted:http://rs.tdwg.org/dwc/terms/Occurrence:http://rs.tdwg.org/dwc/terms/lifeStage without a clear way to associate each column with the rest of its occurrence/taxon record, except by reference to its position in the column sequence which is not ideal.

So I think we need a way to clearly indicate the source and target components of the association record. We could do this by modifying the notation for the columns like this: refuted:[record component]:[row type]:[term uri] where record component is one of the following: reference, association, occurrence, targetOccurence, taxon, targetTaxon.

What do you think? I you can come up with a better solution to this problem, I would be happy to discuss it.

seltmann commented 4 years ago

@jhpoelen @KatjaSchulz I wonder if the information could also include information to help correct the data? I checked and I have 125 records for Andrena imitatrix, none with the interaction "has flowers visited by" as an interaction term. Could the interactionID be maintained? For collection data, it would be important to maintain the catalogNumber.

KatjaSchulz commented 4 years ago

The problem is that there is no such thing as an interactionID in the GloBI schema. That's why it can often be difficult to identify the GloBI record that is the target of a particular refutation. In the scenario proposed above, we are trying to address this by including all data from the original record in the refutation record . The catalogNumber from the original record would then be preserved in refuted:http://rs.tdwg.org/dwc/terms/Occurrence:http://rs.tdwg.org/dwc/terms/catalogNumber

jhpoelen commented 4 years ago

I wonder if the information could also include information to help correct the data?

Once we figure out some scheme to allow for linking to the refuted data, I imagine that we can add new features that use that information to suggest ways to correct the data.

I checked and I have 125 records for Andrena imitatrix, none with the interaction "has flowers visited by" as an interaction term. Could the interactionID be maintained?

Currently, GloBI is not minting any persistent identifiers. This interactionID is a temporary way to help populate the DwC-A star-scheme. Minting a persistent identifier for interaction records is not technically hard, but maintaining a persistent identifier and their links to interaction records is quite a commitment and would make maintaining GloBI more time consuming.

So instead, we are trying to figure out a way to use identifying information in the existing dataset records to help identity a specific interaction record.

Please note that I am working on way to document the provenance of a data records in detail. This would allow for a way to point to a specific location in a reliably referenced data archive from which the data was extracted.

For collection data, it would be important to maintain the catalogNumber.

Yes! In a discussion with @KatjaSchulz, we decided to include all possible data of a species interaction data record along with the refuted record. For collection records, this would include (if available) catalogNumber for both source and terget specimen records.

jhpoelen commented 4 years ago

Also, in replying to @KatjaSchulz 's

So I think we need a way to clearly indicate the source and target components of the association record. We could do this by modifying the notation for the columns like this: refuted:[record component]:[row type]:[term uri] where record component is one of the following: reference, association, occurrence, targetOccurence, taxon, targetTaxon.

Your proposal of including record components in the refutation would help to identify the interaction parts. I was wondering whether to use the column name used in interactions.tsv on https://globalbioticinteractions.org/data . This would connect less to DarwinCore land, but would make the GloBI data products a little more consistent.

So, another way of including the source/target idea would be to include column name of referenced dataset like: refuted:sourceTaxonName, refuted:targetTaxonName, etc. or perhaps (keeping the CamelCase): refutedSourceTaxonName, refutedTargetTaxonName, refutedInteractionType, refutedSourceCatalogNumber .

Or using a colon delimiter: refuted:source:taxon:name, refuted:source:taxon:id, refuted:target:taxon:name, refuted:interaction_type:id, etc.

@seltmann @KatjaSchulz curious to hear your thoughts on how to refer to a interaction data record without introducing some (artificial) unique identifier for an interaction record.

seltmann commented 4 years ago

@jhpoelen I might not be understanding completely, but why does the interactionID from a dataset have to be considered any other way except an identifier only useful in a dataset? I think what I am asking is, can some of the data from the originally submitted interactions.tsv be included in the report to help identify the record? This could be in an easily ignored text block that includes the interaction term as it was provided and whatever arbitrary identifiers.

KatjaSchulz commented 4 years ago

Correction: I checked and the value for http://rs.tdwg.org/dwc/terms/catalogNumber is empty in the GloBI DwC-A export. So it looks like catalog numbers are not preserved. In fact, all the fields in the occurrence extension are empty except for occurrenceID and taxonID.

@jhpoelen Yes, we could use the column names from interactions.tsv whenever they can be mapped from the DwC-A. The only column we would probably want to preserve from the DwC-A is associationID because it will make it easier for us to keep track of which DwC-A record exactly we are refuting.

So the full complement of columns would then be like this:

identifier argumentTypeId argumentTypeName argumentReasonID argumentReasonName interactionTypeId interactionTypeName sourceCitation sourceArchiveURI sourceTaxonId sourceTaxonName sourceTaxonRank sourceTaxonKingdomName targetTaxonId targetTaxonName targetTaxonRank targetTaxonKingdomName refuted:associationID (from DwC-A: association:associationID) refuted:interactionTypeId (from DwC-A: association:associationType) refuted:referenceCitation (from DwC-A: reference:full_reference) refuted:referenceDoi (from DwC-A: reference:referenceDoi) refuted:referenceUrl (from DwC-A: reference:referenceUrl) refuted:sourceCitation (from DwC-A: association:source) refuted:sourceOccurrenceId (from DwC-A: occurrence:occurrenceID) refuted:sourceTaxonId (from DwC-A: taxon:taxonID) refuted:sourceTaxonName (from DwC-A: taxon:scientificName) refuted:sourceTaxonRank (from DwC-A: taxon:taxonRank) refuted:sourceTaxonGenusName (from DwC-A: taxon:genus) refuted:sourceTaxonFamilyName (from DwC-A: taxon:family) refuted:sourceTaxonOrderName (from DwC-A: taxon:order) refuted:sourceTaxonClassName (from DwC-A: taxon:class) refuted:sourceTaxonPhylumName (from DwC-A: taxon:phylum) refuted:sourceTaxonKingdomName (from DwC-A: taxon:kingdom) refuted:targetOccurrenceId (from DwC-A: occurrence:occurrenceID) refuted:targetTaxonId (from DwC-A: taxon:taxonID) refuted:targetTaxonName (from DwC-A: taxon:scientificName) refuted:targetTaxonRank (from DwC-A: taxon:taxonRank) refuted:targetTaxonGenusName (from DwC-A: taxon:genus) refuted:targetTaxonFamilyName (from DwC-A: taxon:family) refuted:targetTaxonOrderName (from DwC-A: taxon:order) refuted:targetTaxonClassName (from DwC-A: taxon:class) refuted:targetTaxonPhylumName (from DwC-A: taxon:phylum) refuted:targetTaxonKingdomName (from DwC-A: taxon:kingdom)

jhpoelen commented 4 years ago

@KatjaSchulz thanks for pointing out that the catalogNumber is not being populated in the DwC-A exports. This is a separate issue (see https://github.com/globalbioticinteractions/globalbioticinteractions/issues/529).

I'd imagine it would be useful the catalogNumber or any other identifying information (e.g., institutionCode, collectionId) to the interaction.tsv and associated refutation fields. Please confirm and suggest any fields that would be helpful to help link the records.

Also, @seltmann please let me know if @KatjaSchulz helped answer your question. If not, I'd like to suggest to have a live video chat to discuss.

KatjaSchulz commented 4 years ago

Here are the columns from the DwC occurrence extension that EOL uses but that are empty in the GloBI DwC-A:

institutionCode collectionCode catalogNumber sex lifeStage reproductiveCondition behavior establishmentMeans occurrenceRemarks individualCount preparations fieldNotes samplingProtocol samplingEffort identifiedBy dateIdentified eventDate modified locality decimalLatitude decimalLongitude verbatimLatitude verbatimLongitude verbatimElevation basisOfRecord physiologicalState bodyPart

The term uris for these are in the DwC-A meta.xml file. If you let me know which ones of these you are going to implement in the GloBI DwC-A, I will add those to the refute columns.

jhpoelen commented 4 years ago

@KatjaSchulz thanks for sharing. I hope to find time time to help populate these fields. Meanwhile, let us move forward on including the proposed columns in the refuted record columns. Note that many of the interaction records will not have this information anyway because they are not derived from specimen records (e.g., literature records).

KatjaSchulz commented 4 years ago

Sure, we can always add additional columns later.

KatjaSchulz commented 4 years ago

@jhpoelen Eli has updated the EOL refutation resource to the new format. Please have a look and let us know if you see any problems: https://opendata.eol.org/dataset/globi/resource/92595520-35f3-48f2-95cf-ea67f7c455c3

jhpoelen commented 4 years ago

@KatjaSchulz thanks for sharing. I had a peek and it is looking pretty good! For now, I am using the argumentReasonId and argumentReasonName as referenceUrl and referenceCitation resp.

With this, the basic "elton review" command line tool (and associated reports) now allows for tracing which record you weren't a fan of and why.

For instance, I'd use the information you provided to help determine why GloBI reported that some insect (Coelioxys sp.) was visiting the flowers of a bird (Galerida sp.) . Perhaps, if there's some need and room, we can even expose this information in some Web UI.

{
  "reviewId": "4521105f-ff55-45c5-896d-bc7303beb359",
  "reviewDate": "2020-09-03T19:26:59Z",
  "reviewerName": "GloBI automated reviewer (elton-0.3.5-SNAPSHOT)",
  "reviewCommentType": "info",
  "reviewComment": "biotic interaction found",
  "namespace": "local",
  "context": {
    "interactionTypeNameVerbatim": "has flowers visited by",
    "interactionTypeName": "flowersVisitedBy",
    "refuted:sourceTaxonId": "GBIF:1338333",
    "refuted:targetTaxonKingdomName": "Animalia",
    "refuted:sourceTaxonRank": "genus",
    "targetTaxonKingdomName": "Animalia",
    "refuted:sourceTaxonPhylumName": "Arthropoda",
    "refuted:sourceTaxonKingdomName": "Animalia",
    "delimiter": "\t",
    "refuted:targetTaxonRank": "genus",
    "refuted:targetTaxonOrderName": "Passeriformes",
    "sourceCitation": "Biotic interaction data that failed Encyclopedia of Life data validation",
    "targetTaxonRank": "genus",
    "sourceTaxonId": "GBIF:1338333",
    "identifier": "EOLrefute_globi:assoc:2031055-GBIF:1338333-VISITS_FLOWERS_OF-GBIF:2490666",
    "sourceArchiveURI": "https://github.com/globalbioticinteractions/refuted-biotic-interactions-by-eol/blob/master/interactions.tsv",
    "refuted:sourceOccurrenceId": "globi:occur:source:2031055-GBIF:1338333-VISITS_FLOWERS_OF",
    "studyTitle": "Katja Schulz. 2020. Collection of refuted species associations claims provided by Enclyclopedia of Life. Accessed at <https://editors.eol.org/eol_php_code/applications/content_server/resources/interactions.tsv> on 03 Sep 2020.Records of organisms other than plants having flower visitors are probably errors",
    "refuted:targetTaxonFamilyName": "Alaudidae",
    "referenceUrl": "https://editors.eol.org/eol_php_code/applications/content_server/resources/interactions.tsv",
    "refuted:referenceDoi": "10.2317/0407.08.1",
    "studySourceCitation": "Katja Schulz. 2020. Collection of refuted species associations claims provided by Enclyclopedia of Life. Accessed at <https://editors.eol.org/eol_php_code/applications/content_server/resources/interactions.tsv> on 03 Sep 2020.",
    "refuted:sourceTaxonClassName": "Insecta",
    "argumentReasonID": "EOL-GloBI-validation7",
    "refuted:associationID": "globi:assoc:2031055-GBIF:1338333-VISITS_FLOWERS_OF-GBIF:2490666",
    "refuted:interactionTypeId": "http://purl.obolibrary.org/obo/RO_0002623",
    "interactionTypeIdVerbatim": "http://purl.obolibrary.org/obo/RO_0002623",
    "refuted:sourceTaxonGenusName": "Coelioxys",
    "refuted:sourceTaxonFamilyName": "Megachilidae",
    "referenceCitation": "Records of organisms other than plants having flower visitors are probably errors",
    "argumentTypeName": "refute",
    "headerRowCount": "1",
    "refuted:targetOccurrenceId": "globi:occur:target:2031055-GBIF:1338333-VISITS_FLOWERS_OF-GBIF:2490666",
    "sourceTaxonRank": "genus",
    "targetTaxonId": "GBIF:2490666",
    "refuted:sourceTaxonOrderName": "Hymenoptera",
    "argumentReasonName": "Records of organisms other than plants having flower visitors are probably errors",
    "interactionTypeId": "http://purl.obolibrary.org/obo/RO_0002623",
    "sourceTaxonKingdomName": "Animalia",
    "refuted:targetTaxonClassName": "Aves",
    "refuted:sourceCitation": "Seltmann, Katja C. 2020. Biotic species interactions about bees (Anthophila) manually extracted from literature.. Accessed at <https://github.com/Extended-Bee-Network/bee-interaction-database/archive/7aadc58b8ee258e2faa4488e0b9ffb0340563684.zip> on 02 Sep 2020.",
    "sourceTaxonName": "Coelioxys",
    "refuted:sourceTaxonName": "Coelioxys",
    "dcterms:bibliographicCitation": "Katja Schulz. 2020. Collection of refuted species associations claims provided by Enclyclopedia of Life.",
    "url": "https://editors.eol.org/eol_php_code/applications/content_server/resources/interactions.tsv",
    "refuted:referenceCitation": "Frankie, G. W., Thorp, R. W., Schindler, M., Hernandez, J., Ertter, B., & Rizzardi, M. (2005). Ecological Patterns of Bees and Their Host Ornamental Flowers in Two Northern California Cities.Journal of the Kansas Entomological Society,78(3), 227246. https://doi.org/10.2317/0407.08.1",
    "refuted:targetTaxonGenusName": "Galerida",
    "refuted:targetTaxonId": "GBIF:2490666",
    "refuted:referenceUrl": "https://doi.org/10.2317/0407.08.1",
    "refuted:targetTaxonPhylumName": "Chordata",
    "targetTaxonName": "Galerida",
    "argumentTypeId": "https://en.wiktionary.org/wiki/refute",
    "refuted:targetTaxonName": "Galerida"
  }
}
jhpoelen commented 4 years ago

@KatjaSchulz one minor comment - I noticed that argumentReasonID was used instead of argumentReasonId (notice the ID -> Id). I prefer argumentReasonId because it would be consistent with other variable names. That said, I should probably cleanup some DwC field names accordingly.

Also, I am waiting for changes to be indexed by GloBI to review how it'll look on the GloBI website.

KatjaSchulz commented 4 years ago

Oops, that was probably my bad. I'm pretty sure it was already that way in the last version. I'm sure it's easy to change. I'll ask Eli.

jhpoelen commented 4 years ago

@KatjaSchulz Thanks for adding the verbatim refuted interaction records to EOL's refuted interaction dataset.

Your changes have propagated. I've attached an example of a refuted horse pathogen. I leave the discussion whether this interaction is actually valid up to the experts.

Screenshot from 2020-09-07 17-06-49

Closing issue for now.