Closed cnrdh closed 2 years ago
Documentation of transform, with input/output example.
Input
~/npolar/marine-db$ cat data/deposit/iopan/protist-biodiversity/total_database_npi2009-2013.tsv| ./bin/csv-transform --ndjson | ndjson-filter 'd.name === "MOSJ13" && d.no==="1670" && d.data==="2013-07-29" && d.takson==="Thalassiosira pacifica"'
{
"name": "MOSJ13",
"no": "1670",
"station ": "R10",
"depth [m]": "0",
"data": "2013-07-29",
"V-taken [ml]": "10",
"Vth filtered [L]": "32",
"V bottle [ml]": "100",
"Class/Phylum": "Bacillariophyceae ",
"takson": "Thalassiosira pacifica",
"takson_add": "",
"Taxon_full": "Thalassiosira pacifica ",
"AphiaID": "",
"K": "450.02",
"N": "6",
"fields": "60",
"magn": "10",
"cells in chamb": "45.002",
"cells in V bottle [ml]": "450.02",
"cells in 1000 ml": "14.063125",
"Gear": "Micro"
}
Output
~/npolar/marine-db$ cat data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson | ndjson-filter 'd.fieldNumber==="MOSJ13-1670" && d.scientificName==="Thalassiosira pacifica"'
{
"maximumDepthInMeters": 0,
"magnification": 10,
"identifiedBy": "iopan.pl",
"organismQuantityType": "cells/l",
"scientificName": "Thalassiosira pacifica",
"materialSampleID": "MOSJ13-1670@MOSJ2013",
"year": 2013,
"expedition": "MOSJ2013",
"locationID": "R10",
"fieldNumber": "MOSJ13-1670",
"basisOfRecord": "Occurrence",
"organismQuantity": 14.063125,
"individualCount": 6,
"sampleSizeValue": 0.42664770454646467,
"sampleSizeUnit": "l",
"occurrenceStatus": "present",
"quantificationStatus": "verified",
"fieldsInCount": 60,
"maxFields": 450.02,
"takenVolume": 10,
"bottleVolume": 100,
"initialVolume": 32,
"cellsInChamber": 45.002,
"gear": "Niskin bottle"
}
The input Gear is a mixed bag:
95 "h. net"
796 "h.net"
818 "Micro"
1373 "micro"
548 "Micropl"
1695 "Niskin"
9627 "niskin"
2 null
After:
891 "Handnet"
14063 "Niskin bottle"
Passes GBIF validataion But there are quite a few taxon issues:
Taxon match higherrank: 4617 Taxon match none: 1203 Taxon match fuzzy: 35
No errors in quantification
~/npolar/marine-db$ cat data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson | ndjson-map [d.gear,d.quantificationStatus] | sort | uniq -c
715 ["Handnet","incalculable"]
20 ["Handnet","verified"]
22 ["Niskin bottle","calculated"]
13 ["Niskin bottle","incalculable"]
12677 ["Niskin bottle","verified"]
These occurences must be merged with sampling event metadata. As it stands now, there are a large number of non-matches from 2010 and 2012. Needs further work to clarify.
$ ndjson-join --left d.fieldNumber <(cat $total_database_npi | ./bin/dwc-occurrence-csv-transform | ndjson-filter 'd.expedition !== "MOSJ2011"') $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map '[d.expedition]' | sort | uniq -c
309 ["ICE2010"]
180 ["ICE2012"]
Puh, not as bad, the actual missing samples of those 489 lines above are just 20:
~/npolar/marine-db$ ndjson-join --left d.fieldNumber data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map '[d.expedition,d.fieldNumber]' | sort | uniq -c
34 ["ICE2010","ICE10-152"]
32 ["ICE2010","ICE10-155"]
16 ["ICE2010","ICE10-156"]
13 ["ICE2010","ICE10-157"]
13 ["ICE2010","ICE10-158"]
18 ["ICE2010","ICE10-253"]
27 ["ICE2010","ICE10-379"]
36 ["ICE2010","ICE10-380"]
38 ["ICE2010","ICE10-381"]
46 ["ICE2010","ICE10-382"]
21 ["ICE2010","ICE10-383"]
15 ["ICE2010","ICE10-384"]
30 ["ICE2012","Agneta"]
30 ["ICE2012","Divehole"]
14 ["ICE2012","ICE12-760"]
15 ["ICE2012","ICE12-822"]
25 ["ICE2012","ICE12-Core2.1.1"]
27 ["ICE2012","ICE12-Core2.1.2"]
20 ["ICE2012","Pond"]
19 ["ICE2012","Ridge"]
XY could be resurrected by matching on locationID, eg. all ICE10-15x, are from "R4", and all other R4 from ICE10 are [22.1166, 80.605]
cat $events | grep ICE10- | grep R4 | ndjson-reduce 'p.x = [...new Set([...p.x, d.decimalLongitude])], p.y = [...new Set([...p.y, d.decimalLatitude])], p' '{x:[],y:[]}'
{"x":[22.1166],"y":[80.605]}
$ ndjson-join --left d.fieldNumber data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map [d.fieldNumber,d.locationID] | sort | uniq -c
30 ["Agneta","Underice"]
30 ["Divehole","5mdepth"]
34 ["ICE10-152","R4"]
32 ["ICE10-155","R4"]
16 ["ICE10-156","R4"]
13 ["ICE10-157","R4"]
13 ["ICE10-158","R4"]
18 ["ICE10-253","R6b"]
27 ["ICE10-379","ICE10-16"]
36 ["ICE10-380","ICE10-16"]
38 ["ICE10-381","ICE10-16"]
46 ["ICE10-382","ICE10-16"]
21 ["ICE10-383","ICE10-16"]
15 ["ICE10-384","ICE10-16"]
14 ["ICE12-760","Floe1"]
15 ["ICE12-822","Floe1"]
25 ["ICE12-Core2.1.1","1mbelowice"]
27 ["ICE12-Core2.1.2","undericehole"]
20 ["Pond","1mbelowice"]
19 ["Ridge","Floe1"]
Found an alternative source of ICE10 data. Here all station R4 / bottle 15x appears to be on "2010-08-19" (some Excel date mess).
Source:
Anette Wold/Marinbiology Database/Phytoplankton taxonomy/2010/Original IOPAS/konghau_database_completeICE10.xls
Alternate source of ICE2012 (2670 lines of data vs 2579 in "total database" :/)
Anette Wold/Marinbiology Database/Phytoplankton taxonomy/2012/Original/ice2012wcolumndatabase.xlsx
Consider swapping in ICE2012 from ice2012wcolumndatabase
, but needs cleaning,
erros in JSON schema validation:
4 [{"keyword":"pattern","dataPath":".scientificName","schemaPath":"#/properties/scientificName/pattern","params":{"pattern":"^[A-Z][a-z\\s-]"},"message":"should match pattern \"^[A-Z][a-z\\s-]\""}]
43 [{"keyword":"type","dataPath":".maximumDepthInMeters","schemaPath":"#/properties/maximumDepthInMeters/type","params":{"type":"number,null"},"message":"should be number,null"}]
1 [{"keyword":"type","dataPath":".organismQuantity","schemaPath":"#/properties/organismQuantity/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"type","dataPath":".individualCount","schemaPath":"#/properties/individualCount/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"type","dataPath":".maximumDepthInMeters","schemaPath":"#/properties/maximumDepthInMeters/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"pattern","dataPath":".scientificName","schemaPath":"#/properties/scientificName/pattern","params":{"pattern":"^[A-Z][a-z\\s-]"},"message":"should match pattern \"^[A-Z][a-z\\s-]\""},{"keyword":"type","dataPath":".fieldsInCount","schemaPath":"#/properties/fieldsInCount/type","params":{"type":"integer"},"message":"should be integer"}]
Checking 2009 against alt. ALK2009_Mercl2009.xls
that contains 1920 Niskin + 662 Micro = 2582.
Total database has 1944+633=2577 when counting 2009 in expedition [and 2617 with year equal 2009 – oh my, but these are 40 "ICE10-116" occurrences and thus excluded already]
$ cat $total_database_npi | ./bin/csv-transform --ndjson | ndjson-filter '/09/.test(d.name)' | ndjson-map d.Gear | sort | uniq -c
633 "micro"
1944 "niskin"
The "total database" contains:
The following are excluded, since there are alternate sources with more data.
After removal, we are left with: