Tracking OBIS Submission for WBTS Calanus Data

Dylan-Pugh commented 2 years ago

I'm creating this issue to track the OBIS submission process for the WBTS Calanus dataset. I've opened a PR which contains the conversion script I used, as well as the three output files: #101.

Tagging @albenson-usgs here for help/guidance on using the IPT!

Please let me know if you have any questions, or see any issues.

MathewBiddle commented 2 years ago

@albenson-usgs, you can find the final Darwin Core files at https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON/data/processed

A description of the process is at https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON

albenson-usgs commented 2 years ago

Thanks Dylan! A few things to change / add before we can load this in the IPT.

[ ] The event file has the units row that comes from accessing via ERDDAP. This needs to be removed as I can't have a eventDate = "UTC" or decimalLatitude = "degrees_north". You can get a version from ERDDAP without this row using the csvp file type I believe.
[ ] For geodeticDatum GBIF wants either "WGS84" or "EPSG:4326" but not both. This is my fault and a legacy of how I was taught by my predecessor and I only recently learned that GBIF flags "EPSG:4326 WGS84". Sorry about that but since we are updating anyway it's worth editing.
[ ] occurrenceID needs to be unique for each row in the occurrence file so in your data there needs to be 1440 unique occurrenceIDs. Right now it's just a single occurrenceID for all rows.
[ ] There is a column Sample_Split in the measurement or fact file and that will currently be dropped if loaded into the IPT as it stands now. I'm not sure what this represents. It needs to be worked in as another measurement/fact.
[ ] There is no data in measurementType. I realize the measurementTypeID will provide the measurement type but it would be good to have a human readable version as well. Especially because it helps me in reviewing the data.
[ ] occurrenceID is missing for all rows in the measurement or fact file but I imagine some of these are measurements of the occurrence like the ones with units grams, grams/m2, or microns.
[ ] No need to include columns with no data in them like measurementAccuracy
[ ] The data in measurementRemarks seems like some of it might be better in the event or occurrence file. For instance "Comments: 74 total pteropods in light box" seems like it would be better in occurrenceRemarks and "Comments: GOMCES station bt0401" seems like it would be better for eventRemarks. No need to include a note that you couldn't find something in NERC. You can ask for a sampling device to be added here.
[ ] Where can I find the metadata for this? Also can you provide a dataset shortname that should be used in the IPT. This will be part of the URL that is created for the dataset and it will the label applied to the DwC-A when someone downloads it from the IPT.
[ ] The measurementMethod should be unique to each measurementType
Some of the columns that are not following Darwin Core will be dropped (e.g. acceptedname, Station_Depth, etc) but as long as that is what you were expecting would happen that's fine to leave them in.

Dylan-Pugh commented 2 years ago

Thanks @albenson-usgs! That's very helpful - I'll start working through these today.

[x] Drop units row from dataset
[x] Update geodeticDatum
[x] Generate unique occurrenceID
[x] Integrate Sample_Split
[x] Add data to measurementType column
[x] Drop columns with no data
[x] The measurementMethod should be unique to each measurementType
[x] Move measuremetnRemarks data to event or occurrence file
[x] occurrenceID is missing for all rows in the measurement or fact file
[x] Provide metadata & dataset short name

Dylan-Pugh commented 2 years ago

I've made a number of updates here, and opened a new PR #103 with the updated script & output files.

A quick question about the measurementRemarks field:

The data in measurementRemarks seems like some of it might be better in the event or occurrence file. For instance "Comments: 74 total pteropods in light box" seems like it would be better in occurrenceRemarks and "Comments: GOMCES station bt0401" seems like it would be better for eventRemarks.

Currently the data in this field comes from the COMMENTS field in the dataset. These comments can be anything, so I don't see a way to programmatically parse them into either occurrenceRemarks or eventRemarks based on the content of a given comment. I'm thinking the three options would be:

Put all comments into the occurrenceRemarks field
Put all comments into the eventRemarks field
Omit this data

Do you have a sense of which of those is preferable?

Metadata

We can use "WBTS_CFIN_2004_2017" as the short name for this dataset.

I've also attached the cruise report for the dataset here, let me know if this format is workable, or if you need something else!

GoM_WBTS_CruiseReport_WBTS_DMAC version_8SEP21.docx

albenson-usgs commented 2 years ago

Great! Let's put it all in occurrenceRemarks.

For the metadata:

Should I have Jeff Runge as the only Resource Contact and put all Co-PIs and Mesozooplankton collection and enumeration as Resource Creators? Should I put you as the Metadata Creator? I will add you as an associated party as the processor so folks know who aligned the data to Darwin Core.
Should we add the sampling protocol into the event file? I know we have "Mesh net cast" in there right now and I think it would be good to keep that but a specific protocol is referenced "Atlantic Zone Monitoring Program (AZMP) established by Fisheries and Oceans Canada (Mitchell et al. 2002)" https://publications.gc.ca/collections/collection_2007/dfo-mpo/Fs97-18-223-2002E.pdf which might be helpful to include. Wish there was a DOI. I tried looking for it in OBPS but doesn't look like it's in there yet. Maybe amend samplingProtocol = "Mesh net cast; Mitchell et al. 2002 https://publications.gc.ca/collections/collection_2007/dfo-mpo/Fs97-18-223-2002E.pdf"?
What is the license for the data? CC-0?
Is there a preferred citation? It's ok if not, I can autogenerate one.

Dylan-Pugh commented 2 years ago

Thanks Abby! I reached out to Jeff Runge for some clarity on the contact info, license, and preferred citation. I also moved all the comments into the occurrenceRemarks field, and updated the samplingProtocol value.

While reviewing I noticed that there were some duplicate occurrenceIDs due to the way the script was generating them. I've corrected the issue so each occurrence now has a unique ID.

Changes can be seen in this PR #104.

I'll update here once I hear back from Jeff!

albenson-usgs commented 2 years ago

After reviewing the newest files-

This eventID "GC120604WBWB-72" has no associated occurrences. Just want to make sure that's correct.
There seems to be one occurrence that doesn't have any measurements "2e4c7daa-abe4-4ba8-92fa-20b48371dbd6"
The nice thing about the emof is that you can have measurements that link only to the events and so you don't have to repeat information for each occurrence. For instance sampling equipment http://vocab.nerc.ac.uk/collection/L05/current/22/ .75DRing you don't need to repeat that information for each occurrence but only for the events so it would be in the emof file 178 times (number of events) and you would only have eventID and no occurrenceID. Happy to hop on a call if it would work better over the phone to discuss.
measurementType needs to be unique for each measurement type. I see repeats of sampling protocol, weight, sampling equipment but with different measurementTypeIDs. I created a file in Notepad++ to show what I think these should be: emof_measurementTypes_and_IDs.txt

Dylan-Pugh commented 2 years ago

Thanks Abby, sorry for my misunderstanding about the measurementType field! Looking at your file I've updated the mappings, let me know if this looks correct to you:

Origin Term	measurementID	measurementType
Net_Type	http://vocab.nerc.ac.uk/collection/L05/current/22/	net type
Mesh_Size	http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/	mesh size
Plankton_Net_Area	http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/	plankton net area
Volume_Filtered	http://vocab.nerc.ac.uk/collection/P25/current/VOL/	volume filtered
Dilution_Factor		dilution factor
Sample_Split		sample split
TOTAL_DILFACTOR_CFIN		total dilution factor CFIN
NET_DEPTH	http://vocab.nerc.ac.uk/collection/P01/current/DXPHPRST/	net depth
Sample_Dry_Weight	http://vocab.nerc.ac.uk/collection/P01/current/ODRYBM01/'	sample dry weight
DW_G_M_2	http://vocab.nerc.ac.uk/collection/P01/current/ODRYBM01/	biomass per area

I checked the source data, and the eventID "GC120604WBWB-72" has no occurrences, so that's correct.

However, there's something strange going on with the occurrenceID: 2e4c7daa-abe4-4ba8-92fa-20b48371dbd6 as you mentioned. It seems to be out of order, so I'm going to investigate further.

Thanks for the clarification on the emof file - that concept makes sense (little by little!), and I'm just trying to think of how to achieve that separation programmatically. I'll open a new PR once I implement the changes and check back here.

Dylan-Pugh commented 2 years ago

Circling back here - I'm wondering how best to handle the event/occurrence split for the MoF file.

I think the thing that's tripping me up is that in the original dataset each row corresponds to a sampling event, and several (between 0 and 8) occurrences. During processing I'm expanding each original row into 8 new rows, each representing an occurrence.

Because of this, all of the data contained in the MoF file is identical for each expanded row - the only thing that changes is data related to an actual organism (sex, lifestage, individualCount, occurrenceStatus). So it seems to me that everything in the MoF should actually be tied to the eventID, because in this case none of the MoF fields are unique to an occurrenceID.

Does that make sense at all? I'd also be available for a quick call tomorrow to discuss!

Here's a diagram of how each input row becomes the 8 output rows:

Screen Shot 2022-04-04 at 11 03 05

albenson-usgs commented 2 years ago

Ok I think I understand this better now and I think you are right. In this case all the eMoFs are event level eMoFs and data in the columns for the different life stages (N, CI, CII, CIII, CIV, CV) and sexes (F, M) is best in individualCount if it's truly a count of the number of individuals- if it's some other type of quantity then we should use organismQuantity and organismQuantityType. My only concern is about having "absences" for different life stages and sexes. Maybe it would be better to only have absences for when Calanus finmarchicus is never found (e.g. that one event GC120604WBWB-72)?

Dylan-Pugh commented 2 years ago

Great - I think switching to organismQuantity and organismQuantityType makes sense. Looking back at the cruise report, the counts for a given stage are defined as:

Number of stage CI per m2

So my thought would be to have organismQuantity be whatever the count is, while organismQuantityType is "individuals per m2" - does that seem reasonable?

As far as recording absences - there are some records which are blank (or NaN) and others that actually record "0" in the Calanus columns. I'm going to reach out to the PI for clarification, because my understanding is that those would actually be treated differently:

0 -> looked for the organism, but it wasn't there
blank -> some kind of error/anomaly in the reporting?

albenson-usgs commented 2 years ago

So my thought would be to have organismQuantity be whatever the count is, while organismQuantityType is "individuals per m2" - does that seem reasonable?

Yes that makes sense to me.

For the absences- yes good to get clarification from the PI for sure about that difference. However, I'm still wondering if it makes sense to have absences for males vs. females or different life stages. Note that the definition of `occurrenceStatus` is "A statement about the presence or absence of a Taxon at a Location." Given this definition I'm just not sure if makes sense to have:	eventID	occurrenceID	scientificName	sex	lifeStage
event1	event1_occ1	Calanus finmarchicus	M		present
event1	event1_occ2	Calanus finmarchicus	F		absent
event1	event1_occ3	Calanus finmarchicus		nauplii	absent
event1	event1_occ3	Calanus finmarchicus		copepodite	present
event1	event1_occ3	Calanus finmarchicus		adult	absent

In the example I have above Calanus finmarchicus (taxon) is present at the location (event1) it's just that not all sexes or life stages are present. I would not include the sex / life stages "absences" so it would look like this:	eventID	occurrenceID	scientificName	sex	lifeStage	occurrenceStatus
event1	event1_occ1	Calanus finmarchicus	M		present
event1	event1_occ3	Calanus finmarchicus		copepodite	present

The time when I would include a row for absent is when absolutely no Calanus finmarchicus are found. But I'm curious what others think about this so I will put this question over into the Slack for more discussion.

Dylan-Pugh commented 2 years ago

That makes a lot of sense, and I think you're right about the absences. The fact that the definition specifically mentions Taxon definitely helps clarify that in my mind.

For the moment I've gone ahead and updated the script to ignore "missing" sex and life stage records, and the output now looks like your example above. I'll keep an eye on the discussion in Slack and can amend this if need be.

I also verified that blank records mean that an organism was not counted. From the cruise report:

Table cell showing NaN indicates not counted.

so those records are now ignored.

I've opened a new PR with those changes and some additional corrections: #105

Metadata

The PI also responded to my earlier questions about metadata:

Resource Contacts: Jeffrey Runge, Lee Karp Boss
Metadata Creator: Dylan Pugh
Citation: no preference, feel free to generate

He was unsure about the license question, so I'm following up with a few other people on that. Is this in reference to the source data's existing license, or the license which will be applied to the DwC files?

albenson-usgs commented 2 years ago

Does this page help with the licenses question? It's the license that will be applied to the DwC files. Note that they must select one of three licenses or the data cannot be published to OBIS and GBIF: CC-0, CC-BY, CC-BY-NC.

Dylan-Pugh commented 2 years ago

Hi @albenson-usgs - just wanted to circle back here! I heard back from the PI, and we'd like to use CC-BY for the license.

albenson-usgs commented 2 years ago

Does that mean we're a go for publishing? Should I load what's here into the OBIS-USA IPT and publish to OBIS and GBIF?

Dylan-Pugh commented 2 years ago

Sure thing - that sounds good to me!

albenson-usgs commented 2 years ago

Dylan I'm just doing a quick recheck before I publish. Previously there was only one event with no occurrences but now there are 88 events with no occurrences- is that accurate?

Also the negative sign is missing from 13 of the longitude values.

Dylan-Pugh commented 2 years ago

Thanks Abby - I'm correcting the longitude values in the source data now, and will also verify the missing occurrences.

Dylan-Pugh commented 2 years ago

Hi @albenson-usgs - sorry for the super long delay in getting back to you. I've corrected the issue with the missing negative signs, and confirmed that there are 88 events with no occurrences - so the DwC files should be correct.

I've opened a PR with the changes here: https://github.com/ioos/bio_data_guide/pull/108

Hopefully we'll be all set once it's merged in!

MathewBiddle commented 2 years ago

I went ahead and merged it in.

albenson-usgs commented 2 years ago

@Dylan-Pugh is this a one off and will never be updated or will this be updated with more observations in the future? I'm trying to decide if I should include the dates in the title of the resource (Wilkinson Basin Time Series Station (WBTS): MESOZOOPLANKTON 2004-2017) or not (Wilkinson Basin Time Series Station (WBTS): MESOZOOPLANKTON)

Dylan-Pugh commented 2 years ago

This should be a one off! I don't think any new data will be added in the future.

albenson-usgs commented 2 years ago

Published! https://www1.usgs.gov/obis-usa/ipt/resource?r=gom_wbts_mesozooplankton

Dylan-Pugh commented 2 years ago

Thanks Abby & Matt!

albenson-usgs commented 2 years ago

Thank you! Teamwork!

albenson-usgs commented 2 years ago

Here's the dataset in OBIS https://obis.org/dataset/5ef55cd8-05a1-4569-8e17-ceb224e40f59 :-)

MathewBiddle commented 2 years ago

And GBIF - https://www.gbif.org/dataset/29651377-23c8-4f45-b439-693a1a23cee1!

ioos / bio_data_guide

Tracking OBIS Submission for WBTS Calanus Data #102

Metadata

Metadata