ioos / bio_data_guide

Standardizing Marine Biological Data Working Group - An open community to facilitate the mobilization of biological data to OBIS.
https://ioos.github.io/bio_data_guide/
MIT License
46 stars 21 forks source link

Tracking OBIS Submission for WBTS Calanus Data #102

Closed Dylan-Pugh closed 2 years ago

Dylan-Pugh commented 2 years ago

I'm creating this issue to track the OBIS submission process for the WBTS Calanus dataset. I've opened a PR which contains the conversion script I used, as well as the three output files: #101.

Tagging @albenson-usgs here for help/guidance on using the IPT!

Please let me know if you have any questions, or see any issues.

MathewBiddle commented 2 years ago

@albenson-usgs, you can find the final Darwin Core files at https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON/data/processed

A description of the process is at https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON

albenson-usgs commented 2 years ago

Thanks Dylan! A few things to change / add before we can load this in the IPT.

Dylan-Pugh commented 2 years ago

Thanks @albenson-usgs! That's very helpful - I'll start working through these today.

Dylan-Pugh commented 2 years ago

I've made a number of updates here, and opened a new PR #103 with the updated script & output files.

A quick question about the measurementRemarks field:

The data in measurementRemarks seems like some of it might be better in the event or occurrence file. For instance "Comments: 74 total pteropods in light box" seems like it would be better in occurrenceRemarks and "Comments: GOMCES station bt0401" seems like it would be better for eventRemarks.

Currently the data in this field comes from the COMMENTS field in the dataset. These comments can be anything, so I don't see a way to programmatically parse them into either occurrenceRemarks or eventRemarks based on the content of a given comment. I'm thinking the three options would be:

  1. Put all comments into the occurrenceRemarks field
  2. Put all comments into the eventRemarks field
  3. Omit this data

Do you have a sense of which of those is preferable?

Metadata

We can use "WBTS_CFIN_2004_2017" as the short name for this dataset.

I've also attached the cruise report for the dataset here, let me know if this format is workable, or if you need something else!

albenson-usgs commented 2 years ago

Great! Let's put it all in occurrenceRemarks.

For the metadata:

  1. Should I have Jeff Runge as the only Resource Contact and put all Co-PIs and Mesozooplankton collection and enumeration as Resource Creators? Should I put you as the Metadata Creator? I will add you as an associated party as the processor so folks know who aligned the data to Darwin Core.
  2. Should we add the sampling protocol into the event file? I know we have "Mesh net cast" in there right now and I think it would be good to keep that but a specific protocol is referenced "Atlantic Zone Monitoring Program (AZMP) established by Fisheries and Oceans Canada (Mitchell et al. 2002)" https://publications.gc.ca/collections/collection_2007/dfo-mpo/Fs97-18-223-2002E.pdf which might be helpful to include. Wish there was a DOI. I tried looking for it in OBPS but doesn't look like it's in there yet. Maybe amend samplingProtocol = "Mesh net cast; Mitchell et al. 2002 https://publications.gc.ca/collections/collection_2007/dfo-mpo/Fs97-18-223-2002E.pdf"?
  3. What is the license for the data? CC-0?
  4. Is there a preferred citation? It's ok if not, I can autogenerate one.
Dylan-Pugh commented 2 years ago

Thanks Abby! I reached out to Jeff Runge for some clarity on the contact info, license, and preferred citation. I also moved all the comments into the occurrenceRemarks field, and updated the samplingProtocol value.

While reviewing I noticed that there were some duplicate occurrenceIDs due to the way the script was generating them. I've corrected the issue so each occurrence now has a unique ID.

Changes can be seen in this PR #104.

I'll update here once I hear back from Jeff!

albenson-usgs commented 2 years ago

After reviewing the newest files-

  1. This eventID "GC120604WBWB-72" has no associated occurrences. Just want to make sure that's correct.
  2. There seems to be one occurrence that doesn't have any measurements "2e4c7daa-abe4-4ba8-92fa-20b48371dbd6"
  3. The nice thing about the emof is that you can have measurements that link only to the events and so you don't have to repeat information for each occurrence. For instance sampling equipment http://vocab.nerc.ac.uk/collection/L05/current/22/ .75DRing you don't need to repeat that information for each occurrence but only for the events so it would be in the emof file 178 times (number of events) and you would only have eventID and no occurrenceID. Happy to hop on a call if it would work better over the phone to discuss.
  4. measurementType needs to be unique for each measurement type. I see repeats of sampling protocol, weight, sampling equipment but with different measurementTypeIDs. I created a file in Notepad++ to show what I think these should be: emof_measurementTypes_and_IDs.txt
Dylan-Pugh commented 2 years ago

Thanks Abby, sorry for my misunderstanding about the measurementType field! Looking at your file I've updated the mappings, let me know if this looks correct to you:

Origin Term measurementID measurementType
Net_Type http://vocab.nerc.ac.uk/collection/L05/current/22/ net type
Mesh_Size http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/ mesh size
Plankton_Net_Area http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/ plankton net area
Volume_Filtered http://vocab.nerc.ac.uk/collection/P25/current/VOL/ volume filtered
Dilution_Factor dilution factor
Sample_Split sample split
TOTAL_DILFACTOR_CFIN total dilution factor CFIN
NET_DEPTH http://vocab.nerc.ac.uk/collection/P01/current/DXPHPRST/ net depth
Sample_Dry_Weight http://vocab.nerc.ac.uk/collection/P01/current/ODRYBM01/' sample dry weight
DW_G_M_2 http://vocab.nerc.ac.uk/collection/P01/current/ODRYBM01/ biomass per area

I checked the source data, and the eventID "GC120604WBWB-72" has no occurrences, so that's correct.

However, there's something strange going on with the occurrenceID: 2e4c7daa-abe4-4ba8-92fa-20b48371dbd6 as you mentioned. It seems to be out of order, so I'm going to investigate further.

Thanks for the clarification on the emof file - that concept makes sense (little by little!), and I'm just trying to think of how to achieve that separation programmatically. I'll open a new PR once I implement the changes and check back here.

Dylan-Pugh commented 2 years ago

Circling back here - I'm wondering how best to handle the event/occurrence split for the MoF file.

I think the thing that's tripping me up is that in the original dataset each row corresponds to a sampling event, and several (between 0 and 8) occurrences. During processing I'm expanding each original row into 8 new rows, each representing an occurrence.

Because of this, all of the data contained in the MoF file is identical for each expanded row - the only thing that changes is data related to an actual organism (sex, lifestage, individualCount, occurrenceStatus). So it seems to me that everything in the MoF should actually be tied to the eventID, because in this case none of the MoF fields are unique to an occurrenceID.

Does that make sense at all? I'd also be available for a quick call tomorrow to discuss!

Here's a diagram of how each input row becomes the 8 output rows:

Screen Shot 2022-04-04 at 11 03 05

albenson-usgs commented 2 years ago

Ok I think I understand this better now and I think you are right. In this case all the eMoFs are event level eMoFs and data in the columns for the different life stages (N, CI, CII, CIII, CIV, CV) and sexes (F, M) is best in individualCount if it's truly a count of the number of individuals- if it's some other type of quantity then we should use organismQuantity and organismQuantityType. My only concern is about having "absences" for different life stages and sexes. Maybe it would be better to only have absences for when Calanus finmarchicus is never found (e.g. that one event GC120604WBWB-72)?

Dylan-Pugh commented 2 years ago

Great - I think switching to organismQuantity and organismQuantityType makes sense. Looking back at the cruise report, the counts for a given stage are defined as:

Number of stage CI per m2

So my thought would be to have organismQuantity be whatever the count is, while organismQuantityType is "individuals per m2" - does that seem reasonable?

As far as recording absences - there are some records which are blank (or NaN) and others that actually record "0" in the Calanus columns. I'm going to reach out to the PI for clarification, because my understanding is that those would actually be treated differently:

albenson-usgs commented 2 years ago

So my thought would be to have organismQuantity be whatever the count is, while organismQuantityType is "individuals per m2" - does that seem reasonable?

Yes that makes sense to me.

For the absences- yes good to get clarification from the PI for sure about that difference. However, I'm still wondering if it makes sense to have absences for males vs. females or different life stages. Note that the definition of occurrenceStatus is "A statement about the presence or absence of a Taxon at a Location." Given this definition I'm just not sure if makes sense to have: eventID occurrenceID scientificName sex lifeStage occurrenceStatus
event1 event1_occ1 Calanus finmarchicus M present
event1 event1_occ2 Calanus finmarchicus F absent
event1 event1_occ3 Calanus finmarchicus nauplii absent
event1 event1_occ3 Calanus finmarchicus copepodite present
event1 event1_occ3 Calanus finmarchicus adult absent
In the example I have above Calanus finmarchicus (taxon) is present at the location (event1) it's just that not all sexes or life stages are present. I would not include the sex / life stages "absences" so it would look like this: eventID occurrenceID scientificName sex lifeStage occurrenceStatus
event1 event1_occ1 Calanus finmarchicus M present
event1 event1_occ3 Calanus finmarchicus copepodite present

The time when I would include a row for absent is when absolutely no Calanus finmarchicus are found. But I'm curious what others think about this so I will put this question over into the Slack for more discussion.

Dylan-Pugh commented 2 years ago

That makes a lot of sense, and I think you're right about the absences. The fact that the definition specifically mentions Taxon definitely helps clarify that in my mind.

For the moment I've gone ahead and updated the script to ignore "missing" sex and life stage records, and the output now looks like your example above. I'll keep an eye on the discussion in Slack and can amend this if need be.

I also verified that blank records mean that an organism was not counted. From the cruise report:

Table cell showing NaN indicates not counted.

so those records are now ignored.

I've opened a new PR with those changes and some additional corrections: #105

Metadata

The PI also responded to my earlier questions about metadata:

  1. Resource Contacts: Jeffrey Runge, Lee Karp Boss
  2. Metadata Creator: Dylan Pugh
  3. Citation: no preference, feel free to generate

He was unsure about the license question, so I'm following up with a few other people on that. Is this in reference to the source data's existing license, or the license which will be applied to the DwC files?

albenson-usgs commented 2 years ago

Does this page help with the licenses question? It's the license that will be applied to the DwC files. Note that they must select one of three licenses or the data cannot be published to OBIS and GBIF: CC-0, CC-BY, CC-BY-NC.

Dylan-Pugh commented 2 years ago

Hi @albenson-usgs - just wanted to circle back here! I heard back from the PI, and we'd like to use CC-BY for the license.

albenson-usgs commented 2 years ago

Does that mean we're a go for publishing? Should I load what's here into the OBIS-USA IPT and publish to OBIS and GBIF?

Dylan-Pugh commented 2 years ago

Sure thing - that sounds good to me!

albenson-usgs commented 2 years ago

Dylan I'm just doing a quick recheck before I publish. Previously there was only one event with no occurrences but now there are 88 events with no occurrences- is that accurate?

Also the negative sign is missing from 13 of the longitude values.

NegSignMissingLongitude
Dylan-Pugh commented 2 years ago

Thanks Abby - I'm correcting the longitude values in the source data now, and will also verify the missing occurrences.

Dylan-Pugh commented 2 years ago

Hi @albenson-usgs - sorry for the super long delay in getting back to you. I've corrected the issue with the missing negative signs, and confirmed that there are 88 events with no occurrences - so the DwC files should be correct.

I've opened a PR with the changes here: https://github.com/ioos/bio_data_guide/pull/108

Hopefully we'll be all set once it's merged in!

MathewBiddle commented 2 years ago

I went ahead and merged it in.

albenson-usgs commented 2 years ago

@Dylan-Pugh is this a one off and will never be updated or will this be updated with more observations in the future? I'm trying to decide if I should include the dates in the title of the resource (Wilkinson Basin Time Series Station (WBTS): MESOZOOPLANKTON 2004-2017) or not (Wilkinson Basin Time Series Station (WBTS): MESOZOOPLANKTON)

Dylan-Pugh commented 2 years ago

This should be a one off! I don't think any new data will be added in the future.

albenson-usgs commented 2 years ago

Published! https://www1.usgs.gov/obis-usa/ipt/resource?r=gom_wbts_mesozooplankton

Dylan-Pugh commented 2 years ago

Thanks Abby & Matt!

albenson-usgs commented 2 years ago

Thank you! Teamwork!

albenson-usgs commented 2 years ago

Here's the dataset in OBIS https://obis.org/dataset/5ef55cd8-05a1-4569-8e17-ceb224e40f59 :-)

MathewBiddle commented 2 years ago

And GBIF - https://www.gbif.org/dataset/29651377-23c8-4f45-b439-693a1a23cee1!