npolar / marine-db

https://doi.org/10.21334/marine-db
0 stars 0 forks source link

Convert IOPAN protist data from "total database" 2009-2013 into Darwin Core #44

Closed cnrdh closed 2 years ago

cnrdh commented 2 years ago

The "total database" contains:

~/npolar/marine-db$ cat data/deposit/iopan/protist-biodiversity/total_database_npi2009-2013.tsv | ./bin/csv-transform --ndjson | ndjson-map d.name| sort | uniq -c
   1468 "ALK09"
    961 "ALK10"
   3424 "ICE10"
   2579 "ICE12"
   1109 "MER09.01"
   1507 "MOSJ11"
   1638 "MOSJ12"
   2268 "MOSJ13"

The following are excluded, since there are alternate sources with more data.

After removal, we are left with:


~/npolar/marine-db$ cat data/input/iopan/2009-2012-2013-protist-biodiversity-iopan.ndjson| ndjson-map [d.year,d.expedition] | sort | uniq -c
   1468 [2009,"Alkekonge-2009"]
   1109 [2009,"MERCLIM-2009"]
    961 [2010,"Alkekonge-2010"]
   1638 [2012,"MOSJ2012"]
   2268 [2013,"MOSJ2013"]
cnrdh commented 2 years ago

Documentation of transform, with input/output example.

Input

~/npolar/marine-db$ cat data/deposit/iopan/protist-biodiversity/total_database_npi2009-2013.tsv| ./bin/csv-transform --ndjson | ndjson-filter 'd.name === "MOSJ13" && d.no==="1670" && d.data==="2013-07-29" && d.takson==="Thalassiosira pacifica"'
{
    "name": "MOSJ13",
    "no": "1670",
    "station ": "R10",
    "depth [m]": "0",
    "data": "2013-07-29",
    "V-taken [ml]": "10",
    "Vth filtered [L]": "32",
    "V bottle [ml]": "100",
    "Class/Phylum": "Bacillariophyceae ",
    "takson": "Thalassiosira pacifica",
    "takson_add": "",
    "Taxon_full": "Thalassiosira pacifica ",
    "AphiaID": "",
    "K": "450.02",
    "N": "6",
    "fields": "60",
    "magn": "10",
    "cells in chamb": "45.002",
    "cells in V bottle [ml]": "450.02",
    "cells in 1000 ml": "14.063125",
    "Gear": "Micro"
}

Output

~/npolar/marine-db$ cat data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson | ndjson-filter 'd.fieldNumber==="MOSJ13-1670" && d.scientificName==="Thalassiosira pacifica"'
{
    "maximumDepthInMeters": 0,
    "magnification": 10,
    "identifiedBy": "iopan.pl",
    "organismQuantityType": "cells/l",
    "scientificName": "Thalassiosira pacifica",
    "materialSampleID": "MOSJ13-1670@MOSJ2013",
    "year": 2013,
    "expedition": "MOSJ2013",
    "locationID": "R10",
    "fieldNumber": "MOSJ13-1670",
    "basisOfRecord": "Occurrence",
    "organismQuantity": 14.063125,
    "individualCount": 6,
    "sampleSizeValue": 0.42664770454646467,
    "sampleSizeUnit": "l",
    "occurrenceStatus": "present",
    "quantificationStatus": "verified",
    "fieldsInCount": 60,
    "maxFields": 450.02,
    "takenVolume": 10,
    "bottleVolume": 100,
    "initialVolume": 32,
    "cellsInChamber": 45.002,
    "gear": "Niskin bottle"
}
cnrdh commented 2 years ago

The input Gear is a mixed bag:

     95 "h. net"
    796 "h.net"
    818 "Micro"
   1373 "micro"
    548 "Micropl"
   1695 "Niskin"
   9627 "niskin"
      2 null

After:

     891 "Handnet"
  14063 "Niskin bottle"
cnrdh commented 2 years ago

Passes GBIF validataion But there are quite a few taxon issues:

Taxon match higherrank: 4617 Taxon match none: 1203 Taxon match fuzzy: 35

cnrdh commented 2 years ago

No errors in quantification

~/npolar/marine-db$ cat data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson | ndjson-map [d.gear,d.quantificationStatus] | sort | uniq -c
    715 ["Handnet","incalculable"]
     20 ["Handnet","verified"]
     22 ["Niskin bottle","calculated"]
     13 ["Niskin bottle","incalculable"]
  12677 ["Niskin bottle","verified"]
cnrdh commented 2 years ago

These occurences must be merged with sampling event metadata. As it stands now, there are a large number of non-matches from 2010 and 2012. Needs further work to clarify.


$ ndjson-join --left d.fieldNumber <(cat $total_database_npi | ./bin/dwc-occurrence-csv-transform  | ndjson-filter 'd.expedition !== "MOSJ2011"') $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map '[d.expedition]' | sort | uniq -c
    309 ["ICE2010"]
    180 ["ICE2012"]
cnrdh commented 2 years ago

Puh, not as bad, the actual missing samples of those 489 lines above are just 20:


~/npolar/marine-db$ ndjson-join --left d.fieldNumber data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map '[d.expedition,d.fieldNumber]' | sort | uniq -c
     34 ["ICE2010","ICE10-152"]
     32 ["ICE2010","ICE10-155"]
     16 ["ICE2010","ICE10-156"]
     13 ["ICE2010","ICE10-157"]
     13 ["ICE2010","ICE10-158"]
     18 ["ICE2010","ICE10-253"]
     27 ["ICE2010","ICE10-379"]
     36 ["ICE2010","ICE10-380"]
     38 ["ICE2010","ICE10-381"]
     46 ["ICE2010","ICE10-382"]
     21 ["ICE2010","ICE10-383"]
     15 ["ICE2010","ICE10-384"]
     30 ["ICE2012","Agneta"]
     30 ["ICE2012","Divehole"]
     14 ["ICE2012","ICE12-760"]
     15 ["ICE2012","ICE12-822"]
     25 ["ICE2012","ICE12-Core2.1.1"]
     27 ["ICE2012","ICE12-Core2.1.2"]
     20 ["ICE2012","Pond"]
     19 ["ICE2012","Ridge"]
cnrdh commented 2 years ago

XY could be resurrected by matching on locationID, eg. all ICE10-15x, are from "R4", and all other R4 from ICE10 are [22.1166, 80.605]

cat $events | grep ICE10- | grep R4 | ndjson-reduce 'p.x = [...new Set([...p.x, d.decimalLongitude])], p.y = [...new Set([...p.y, d.decimalLatitude])], p' '{x:[],y:[]}'
{"x":[22.1166],"y":[80.605]}

$ ndjson-join --left d.fieldNumber data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map [d.fieldNumber,d.locationID] | sort  | uniq -c
     30 ["Agneta","Underice"]
     30 ["Divehole","5mdepth"]
     34 ["ICE10-152","R4"]
     32 ["ICE10-155","R4"]
     16 ["ICE10-156","R4"]
     13 ["ICE10-157","R4"]
     13 ["ICE10-158","R4"]
     18 ["ICE10-253","R6b"]
     27 ["ICE10-379","ICE10-16"]
     36 ["ICE10-380","ICE10-16"]
     38 ["ICE10-381","ICE10-16"]
     46 ["ICE10-382","ICE10-16"]
     21 ["ICE10-383","ICE10-16"]
     15 ["ICE10-384","ICE10-16"]
     14 ["ICE12-760","Floe1"]
     15 ["ICE12-822","Floe1"]
     25 ["ICE12-Core2.1.1","1mbelowice"]
     27 ["ICE12-Core2.1.2","undericehole"]
     20 ["Pond","1mbelowice"]
     19 ["Ridge","Floe1"]
cnrdh commented 2 years ago

Found an alternative source of ICE10 data. Here all station R4 / bottle 15x appears to be on "2010-08-19" (some Excel date mess).

Source: Anette Wold/Marinbiology Database/Phytoplankton taxonomy/2010/Original IOPAS/konghau_database_completeICE10.xls

cnrdh commented 2 years ago

Alternate source of ICE2012 (2670 lines of data vs 2579 in "total database" :/)

Anette Wold/Marinbiology Database/Phytoplankton taxonomy/2012/Original/ice2012wcolumndatabase.xlsx

cnrdh commented 2 years ago

Consider swapping in ICE2012 from ice2012wcolumndatabase, but needs cleaning, erros in JSON schema validation:


      4 [{"keyword":"pattern","dataPath":".scientificName","schemaPath":"#/properties/scientificName/pattern","params":{"pattern":"^[A-Z][a-z\\s-]"},"message":"should match pattern \"^[A-Z][a-z\\s-]\""}]
     43 [{"keyword":"type","dataPath":".maximumDepthInMeters","schemaPath":"#/properties/maximumDepthInMeters/type","params":{"type":"number,null"},"message":"should be number,null"}]
      1 [{"keyword":"type","dataPath":".organismQuantity","schemaPath":"#/properties/organismQuantity/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"type","dataPath":".individualCount","schemaPath":"#/properties/individualCount/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"type","dataPath":".maximumDepthInMeters","schemaPath":"#/properties/maximumDepthInMeters/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"pattern","dataPath":".scientificName","schemaPath":"#/properties/scientificName/pattern","params":{"pattern":"^[A-Z][a-z\\s-]"},"message":"should match pattern \"^[A-Z][a-z\\s-]\""},{"keyword":"type","dataPath":".fieldsInCount","schemaPath":"#/properties/fieldsInCount/type","params":{"type":"integer"},"message":"should be integer"}]
cnrdh commented 2 years ago

Checking 2009 against alt. ALK2009_Mercl2009.xls that contains 1920 Niskin + 662 Micro = 2582. Total database has 1944+633=2577 when counting 2009 in expedition [and 2617 with year equal 2009 – oh my, but these are 40 "ICE10-116" occurrences and thus excluded already]


$ cat $total_database_npi | ./bin/csv-transform --ndjson | ndjson-filter '/09/.test(d.name)' | ndjson-map d.Gear | sort | uniq -c
    633 "micro"
   1944 "niskin"