iobis / obis-issues

Repository for all OBIS related issues and feature requests
4 stars 3 forks source link

CSV download issues #181

Open Mesibov opened 3 years ago

Mesibov commented 3 years ago
  1. In a recent CSV download, scientificName in the source table became originalScientificName and the processed scientificName was from WoRMS. This corrects spelling and replaces unaccepted names with accepted ones. However

    • scientificNameAuthorship in the download was the same as in the source - the authority was for the originalScientificName, not the processed one, which in many cases was different.
    • aphiaID was the ID for the processed name, but the scientificNameID field in the download was the same as the one in source, not for the processed name.

Example:

scientificName (in source) = Polydora brachycephala
originalScientificName (in CSV)= Polydora brachycephala
scientificName (after OBIS processing) = Dipolydora caulleryi

scientificNameAuthorship (in source and in CSV) = Hartman, 1936
correct scientificNameAuthorship for processed scientificName (not in CSV) = (Mesnil, 1897)

scientificNameID (in source and in CSV) = urn:lsid:marinespecies.org:taxname:338507
aphiaID (after OBIS processing) = 131116
  1. It would be good to offer TSV instead of CSV, as GBIF does. In a recent download quoting was inconsistent - most text items quoted, "true" and "false" not; "month" and "day" field items quoted , "year" field items not. Quoted items in the source table were quoted twice, e.g. "Tasu Sound; ""Submarine Rock""" > """Tasu Sound; """"Submarine Rock""""""". (No quoting needed in a TSV, of course.)

  2. In the same download, items in the bibliographicCitation field were truncated at 160 characters and the 6-character string " [...]" added. Why the truncation?

pieterprovoost commented 3 years ago

Thanks for reporting.

  1. Your analysis is correct. Of the provided fields we currently only replace scientificName with the accepted alternative from WoRMS. The intention was to make data analysis easier for users but I understand that this can be very confusing in combination with the other taxonomy fields. We'll look into a solution to make a clearer distinction between provided fields and annotations.
  2. The way the downloads are set up right now, quoting rules are determined on a batch by batch basis, hence the inconsistencies. TSV may offer a solution there.
  3. This was a quick solution because we had issues with datasets containing multiple MB of citations in every single record. I'll look into this.