Closed Dylan-Pugh closed 2 years ago
@albenson-usgs, you can find the final Darwin Core files at https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON/data/processed
A description of the process is at https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON
Thanks Dylan! A few things to change / add before we can load this in the IPT.
[ ] The event file has the units row that comes from accessing via ERDDAP. This needs to be removed as I can't have a eventDate = "UTC" or decimalLatitude = "degrees_north". You can get a version from ERDDAP without this row using the csvp file type I believe.
[ ] For geodeticDatum
GBIF wants either "WGS84" or "EPSG:4326" but not both. This is my fault and a legacy of how I was taught by my predecessor and I only recently learned that GBIF flags "EPSG:4326 WGS84". Sorry about that but since we are updating anyway it's worth editing.
[ ] occurrenceID
needs to be unique for each row in the occurrence file so in your data there needs to be 1440 unique occurrenceIDs. Right now it's just a single occurrenceID for all rows.
[ ] There is a column Sample_Split in the measurement or fact file and that will currently be dropped if loaded into the IPT as it stands now. I'm not sure what this represents. It needs to be worked in as another measurement/fact.
[ ] There is no data in measurementType
. I realize the measurementTypeID
will provide the measurement type but it would be good to have a human readable version as well. Especially because it helps me in reviewing the data.
[ ] occurrenceID
is missing for all rows in the measurement or fact file but I imagine some of these are measurements of the occurrence like the ones with units grams, grams/m2, or microns.
[ ] No need to include columns with no data in them like measurementAccuracy
[ ] The data in measurementRemarks
seems like some of it might be better in the event or occurrence file. For instance "Comments: 74 total pteropods in light box" seems like it would be better in occurrenceRemarks
and "Comments: GOMCES station bt0401" seems like it would be better for eventRemarks
. No need to include a note that you couldn't find something in NERC. You can ask for a sampling device to be added here.
[ ] Where can I find the metadata for this? Also can you provide a dataset shortname that should be used in the IPT. This will be part of the URL that is created for the dataset and it will the label applied to the DwC-A when someone downloads it from the IPT.
[ ] The measurementMethod
should be unique to each measurementType
Some of the columns that are not following Darwin Core will be dropped (e.g. acceptedname, Station_Depth, etc) but as long as that is what you were expecting would happen that's fine to leave them in.
Thanks @albenson-usgs! That's very helpful - I'll start working through these today.
I've made a number of updates here, and opened a new PR #103 with the updated script & output files.
A quick question about the measurementRemarks
field:
The data in measurementRemarks seems like some of it might be better in the event or occurrence file. For instance "Comments: 74 total pteropods in light box" seems like it would be better in occurrenceRemarks and "Comments: GOMCES station bt0401" seems like it would be better for eventRemarks.
Currently the data in this field comes from the COMMENTS
field in the dataset. These comments can be anything, so I don't see a way to programmatically parse them into either occurrenceRemarks
or eventRemarks
based on the content of a given comment. I'm thinking the three options would be:
occurrenceRemarks
fieldeventRemarks
fieldDo you have a sense of which of those is preferable?
We can use "WBTS_CFIN_2004_2017" as the short name for this dataset.
I've also attached the cruise report for the dataset here, let me know if this format is workable, or if you need something else!
Great! Let's put it all in occurrenceRemarks
.
For the metadata:
samplingProtocol
= "Mesh net cast; Mitchell et al. 2002 https://publications.gc.ca/collections/collection_2007/dfo-mpo/Fs97-18-223-2002E.pdf"?Thanks Abby! I reached out to Jeff Runge for some clarity on the contact info, license, and preferred citation. I also moved all the comments into the occurrenceRemarks
field, and updated the samplingProtocol
value.
While reviewing I noticed that there were some duplicate occurrenceIDs
due to the way the script was generating them. I've corrected the issue so each occurrence now has a unique ID.
Changes can be seen in this PR #104.
I'll update here once I hear back from Jeff!
After reviewing the newest files-
measurementType
needs to be unique for each measurement type. I see repeats of sampling protocol, weight, sampling equipment but with different measurementTypeIDs. I created a file in Notepad++ to show what I think these should be:
emof_measurementTypes_and_IDs.txtThanks Abby, sorry for my misunderstanding about the measurementType
field! Looking at your file I've updated the mappings, let me know if this looks correct to you:
Origin Term | measurementID | measurementType |
---|---|---|
Net_Type | http://vocab.nerc.ac.uk/collection/L05/current/22/ | net type |
Mesh_Size | http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/ | mesh size |
Plankton_Net_Area | http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/ | plankton net area |
Volume_Filtered | http://vocab.nerc.ac.uk/collection/P25/current/VOL/ | volume filtered |
Dilution_Factor | dilution factor | |
Sample_Split | sample split | |
TOTAL_DILFACTOR_CFIN | total dilution factor CFIN | |
NET_DEPTH | http://vocab.nerc.ac.uk/collection/P01/current/DXPHPRST/ | net depth |
Sample_Dry_Weight | http://vocab.nerc.ac.uk/collection/P01/current/ODRYBM01/' | sample dry weight |
DW_G_M_2 | http://vocab.nerc.ac.uk/collection/P01/current/ODRYBM01/ | biomass per area |
I checked the source data, and the eventID "GC120604WBWB-72" has no occurrences, so that's correct.
However, there's something strange going on with the occurrenceID: 2e4c7daa-abe4-4ba8-92fa-20b48371dbd6
as you mentioned. It seems to be out of order, so I'm going to investigate further.
Thanks for the clarification on the emof file - that concept makes sense (little by little!), and I'm just trying to think of how to achieve that separation programmatically. I'll open a new PR once I implement the changes and check back here.
Circling back here - I'm wondering how best to handle the event/occurrence split for the MoF file.
I think the thing that's tripping me up is that in the original dataset each row corresponds to a sampling event, and several (between 0 and 8) occurrences. During processing I'm expanding each original row into 8 new rows, each representing an occurrence.
Because of this, all of the data contained in the MoF file is identical for each expanded row - the only thing that changes is data related to an actual organism (sex
, lifestage
, individualCount
, occurrenceStatus
). So it seems to me that everything in the MoF should actually be tied to the eventID
, because in this case none of the MoF fields are unique to an occurrenceID
.
Does that make sense at all? I'd also be available for a quick call tomorrow to discuss!
Here's a diagram of how each input row becomes the 8 output rows:
Ok I think I understand this better now and I think you are right. In this case all the eMoFs are event level eMoFs and data in the columns for the different life stages (N, CI, CII, CIII, CIV, CV) and sexes (F, M) is best in individualCount
if it's truly a count of the number of individuals- if it's some other type of quantity then we should use organismQuantity
and organismQuantityType
. My only concern is about having "absences" for different life stages and sexes. Maybe it would be better to only have absences for when Calanus finmarchicus is never found (e.g. that one event GC120604WBWB-72)?
Great - I think switching to organismQuantity
and organismQuantityType
makes sense. Looking back at the cruise report, the counts for a given stage are defined as:
Number of stage CI per m2
So my thought would be to have organismQuantity
be whatever the count is, while organismQuantityType
is "individuals per m2" - does that seem reasonable?
As far as recording absences - there are some records which are blank (or NaN) and others that actually record "0" in the Calanus columns. I'm going to reach out to the PI for clarification, because my understanding is that those would actually be treated differently:
So my thought would be to have organismQuantity be whatever the count is, while organismQuantityType is "individuals per m2" - does that seem reasonable?
Yes that makes sense to me.
For the absences- yes good to get clarification from the PI for sure about that difference. However, I'm still wondering if it makes sense to have absences for males vs. females or different life stages. Note that the definition of occurrenceStatus is "A statement about the presence or absence of a Taxon at a Location." Given this definition I'm just not sure if makes sense to have: |
eventID | occurrenceID | scientificName | sex | lifeStage | occurrenceStatus |
---|---|---|---|---|---|---|
event1 | event1_occ1 | Calanus finmarchicus | M | present | ||
event1 | event1_occ2 | Calanus finmarchicus | F | absent | ||
event1 | event1_occ3 | Calanus finmarchicus | nauplii | absent | ||
event1 | event1_occ3 | Calanus finmarchicus | copepodite | present | ||
event1 | event1_occ3 | Calanus finmarchicus | adult | absent |
In the example I have above Calanus finmarchicus (taxon) is present at the location (event1) it's just that not all sexes or life stages are present. I would not include the sex / life stages "absences" so it would look like this: | eventID | occurrenceID | scientificName | sex | lifeStage | occurrenceStatus |
---|---|---|---|---|---|---|
event1 | event1_occ1 | Calanus finmarchicus | M | present | ||
event1 | event1_occ3 | Calanus finmarchicus | copepodite | present |
The time when I would include a row for absent is when absolutely no Calanus finmarchicus are found. But I'm curious what others think about this so I will put this question over into the Slack for more discussion.
That makes a lot of sense, and I think you're right about the absences. The fact that the definition specifically mentions Taxon definitely helps clarify that in my mind.
For the moment I've gone ahead and updated the script to ignore "missing" sex and life stage records, and the output now looks like your example above. I'll keep an eye on the discussion in Slack and can amend this if need be.
I also verified that blank records mean that an organism was not counted. From the cruise report:
Table cell showing NaN indicates not counted.
so those records are now ignored.
I've opened a new PR with those changes and some additional corrections: #105
The PI also responded to my earlier questions about metadata:
He was unsure about the license question, so I'm following up with a few other people on that. Is this in reference to the source data's existing license, or the license which will be applied to the DwC files?
Does this page help with the licenses question? It's the license that will be applied to the DwC files. Note that they must select one of three licenses or the data cannot be published to OBIS and GBIF: CC-0, CC-BY, CC-BY-NC.
Hi @albenson-usgs - just wanted to circle back here! I heard back from the PI, and we'd like to use CC-BY for the license.
Does that mean we're a go for publishing? Should I load what's here into the OBIS-USA IPT and publish to OBIS and GBIF?
Sure thing - that sounds good to me!
Dylan I'm just doing a quick recheck before I publish. Previously there was only one event with no occurrences but now there are 88 events with no occurrences- is that accurate?
Also the negative sign is missing from 13 of the longitude values.
Thanks Abby - I'm correcting the longitude values in the source data now, and will also verify the missing occurrences.
Hi @albenson-usgs - sorry for the super long delay in getting back to you. I've corrected the issue with the missing negative signs, and confirmed that there are 88 events with no occurrences - so the DwC files should be correct.
I've opened a PR with the changes here: https://github.com/ioos/bio_data_guide/pull/108
Hopefully we'll be all set once it's merged in!
I went ahead and merged it in.
@Dylan-Pugh is this a one off and will never be updated or will this be updated with more observations in the future? I'm trying to decide if I should include the dates in the title of the resource (Wilkinson Basin Time Series Station (WBTS): MESOZOOPLANKTON 2004-2017) or not (Wilkinson Basin Time Series Station (WBTS): MESOZOOPLANKTON)
This should be a one off! I don't think any new data will be added in the future.
Thanks Abby & Matt!
Thank you! Teamwork!
Here's the dataset in OBIS https://obis.org/dataset/5ef55cd8-05a1-4569-8e17-ceb224e40f59 :-)
I'm creating this issue to track the OBIS submission process for the WBTS Calanus dataset. I've opened a PR which contains the conversion script I used, as well as the three output files: #101.
Tagging @albenson-usgs here for help/guidance on using the IPT!
Please let me know if you have any questions, or see any issues.