ioos / bio_data_guide

Standardizing Marine Biological Data Working Group - An open community to facilitate the mobilization of biological data to OBIS.
https://ioos.github.io/bio_data_guide/
MIT License
46 stars 21 forks source link

[dataset]: UW Pelagic Hypoxia Hood Canal project, 2012-2013 Zooplankton dataset #168

Open emiliom opened 1 year ago

emiliom commented 1 year ago

Dataset Name

UWPHHCZoop

Link to DwC Data Files

https://github.com/nanoos-pnw/obis-keisterhczoop/tree/main/aligned_csvs

Link to Scripts

https://github.com/nanoos-pnw/obis-keisterhczoop

Link to "raw" Data Files

https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/sourcedata/bcodmo_dataset_682074_data.csv

Describe your dataset and any specific requests.

I've gotten input from the group on this dataset off and on since ~ Nov. 2021. I'm finally (:crossed_fingers:) finishing up the data alignment and will hopefully be able to submit to OBIS this month, or May at the latest. I'm creating this dataset issue here for better tracking. This dataset alignment is being done under NANOOS.

The processing code, source data, and (somewhat outdated) preliminary aligned data files are in this repo: https://github.com/nanoos-pnw/obis-keisterhczoop. I will be pushing an update to the code and aligned data files by tomorrow or so.

@albenson-usgs actually reviewed that earlier version a year ago, soon after the 2022 data mobilization workshop. I'll paste that exchange here as a new comment.

emiliom commented 1 year ago

From @albenson-usgs , 2022-3-18 (originally via email), with my replies below:

[ABBY] As expected everything looks great to me. I only had two things to ask about.

  1. There are 335 (345 - 10 parent events) unique events in the event file but the occurrence and emof files only have 271 unique events. Are there 64 events with no associated occurrences? Should these be absences?

[EMILIO] The event table actually has a 3-level hierarchy: cruises, station visits, and samples (the typology is stored in eventRemarks). There are 10 cruise events and 64 station visit events. There wouldn't be any occurrences associated with these. At this time I guess I haven't created any emof entries corresponding to these events either. I suppose I could create some, but it's not entirely clear what the value of doing that would be, relative to the information already included in the event file.

[ABBY] 2. You might consider removing the columns with no data in them in the emof file (measurementValueID, measurementAccuracy, etc) unless they are placeholders to fill that data in coming up?

[EMILIO] Thanks! I don't think I'll use measurementAccuracy, so I'll remove it. I think I previously had a measurement type that was using measurementValueID. If I end up not using it, I'll make sure to remove it.

albenson-usgs commented 1 year ago

Can you let me know when this is ready to review? It's unclear at this point for me 😊

emiliom commented 1 year ago

I will! It's not ready yet. I'm waiting for an internal review. But, I did intend to update the repo with the latest files, so I'll do that soon.

emiliom commented 1 year ago

@albenson-usgs just an update. I've updated the repo with corrections based on input from the data originator, plus adding dwciri: columns for sex and lifestage in the occurrence table. The code is also cleaner. The data originator team is now reviewing the aligned data and my code, to verify that it looks ok.

So, almost there! But hold off on your review until I hear back from the data originator.

FYI, we're lining up a new zooplankton dataset to align to DwC. It's very similar to this one (also in Puget Sound, and from the same data originator), but much larger.

emiliom commented 1 year ago

@albenson-usgs I've uploaded the final 3 DwC aligned csv's to the OBIS-USA IPT! I mapped each file to DwC but haven't created the metadata yet. Hopefully by the end of this week.

The name of the new resource is uwph_hoodcanalzoop.

One note: the newly accepted event eventType term was not mapped.

albenson-usgs commented 1 year ago

Excellent news! I'm adding @sformel-usgs for awareness. I'm on a detail to USGCRP to work on the US National Nature Assessment and Steve is the node manager for OBIS-USA. I think GBIF hasn't yet added the new terms to the IPT. It looks like they are working on it and hopefully they will have it resolved soon https://github.com/gbif/pipelines/issues/926

emiliom commented 1 year ago

Thanks, @albenson-usgs and @sformel-usgs . Good to know eventType is in the pipeline but officially not supported by the IPT yet.

sformel-usgs commented 1 year ago

@emiliom glad to hear your progress! Do you want me to review the files in the IPT, or wait until you've completed the metadata?

emiliom commented 1 year ago

Thanks. I don't plan to make further changes to the data files. Feel free to review them now or wait until I've completed the metadata. Either way is fine with me, though I'd imagine that waiting to have the metadata in place (ie, the full context) may make things clearer for you when reviewing the files.

emiliom commented 1 year ago

When entering contacts in the metadata, it looks like one can choose from two sections (or both):

I can't tell the difference between those, except that "Associated Parties" provides the option to select a role from a pre-defined list. I think this is important. But "Resource Contacts" and "Resource Creators" are mandatory. I can't tell who should be added under those two Resource sections if I'm already going to populate a list of people under Associated Parties, with proper roles.

BTW, I'm using a published dataset as an example, to help guide me (this is my first IPT submission!). I don't see there where the Resource entries are indicated. Also, there, some people have 2 or 3 roles, but in Associated Parties only one role can be assigned.

Can you provide some guidance? Thanks!

sformel-usgs commented 1 year ago

@emiliom this can be a gray area because these fields are designed to be flexible to accommodate the varying roles and structures in scientific research. It might be easiest to have a quick video chat and do some Q&A. Let me know if you're interested!

Just to zoom out for a bit, don't forget that these fields are capturing information to publish it as EML, and the IPT follows the GBIF Metadata Profile, essentially a custom flavor of EML. So, you can always dig into it by learning more about EML.

Two resources that will be helpful are the GBIF IPT User Manual and the OBIS Manual. You are correct that in EML, each credit is attributed individually. This means that you repeat the same information for a person, for each role they deserve credit for.

image

emiliom commented 1 year ago

Thanks for the follow up and the specific links to more documentation. Yeah, a quick video chat would be great. I'll follow up on the Standardizing Marine Bio Data Slack.

emiliom commented 11 months ago

FYI: People listed under "Basic Metadata > Metadata Providers" are also added to the automatic citation. I found that to be a good role to list me under, in addition to the "Processor" role under "Associated Parties".

emiliom commented 11 months ago

@sformel-usgs the dataset resource at the OBIS-USA IPT is now "ready". But I'm going to wait a bit for input on whether others should review it before hitting "Publish". Hopefully we'll be able to publish it next week!

Speaking of which, to publish the dataset, is clicking on the Publication > Publish link the only thing I need to do? Or will I also need to do something else, like manually clicking on the Visibility > Change link?

sformel-usgs commented 11 months ago

I see you got an answer in the slack, but I'll put it here for posterity:

Make sure to make the visibility "Public" first and then publish after that. It's not a showstopper problem if you do it the other way around but you'll have to publish twice because the first time you publish with the visibility set to private that Darwin Core Archive won't be publicly accessible.

I'm a bit busy these first few weeks of the year, but I'll give it a review soon, in case I find anything we missed in our discussions.

sformel-usgs commented 10 months ago

@emiliom Thanks for your patience! In my review of your data, I found a few things I want to confirm before you hit publish, and a few things to discuss that don't need to hold up publishing. Also tagging @krichards-usgs, new OBIS-USA staff, for awareness, since she is learning about dataset review and publishing.

Before Publishing

  1. OBIS/GBIF is expecting coordinates in WGS84 (EPSG 4326), right now yours are listed as 4386, an Icelandic system? If this is correct, it would be best to describe the current coordinate in verbatimCoordinates and provide WGS84 coordinates in the decimalLat/Long fields. Sorry I didn't catch this sooner.

  2. The way eventDate is currently formatted, it will be interpreted as UTC, not local time. I just want to make sure this is correct since it is a typical thing that can sneak under the radar.

  3. Auto-citation is turned off, and I want to make sure you know that it will be overwritten with the automated version on GBIF, but not OBIS. If this isn't ok, I can work with you to restructure the metadata so it matches your desired citation.

Any point in time

  1. In description maybe say 'BCO-DMO' instead of 'previously published elsewhere', just so readers don't have to go looking for 'elsewhere'.

  2. Right now NANOOS and IOOS are credited in the 'additional metadata'. This is totally fine, but I'd like to check in with @MathewBiddle, in case we can help out the IOOS/RA credit tracking process by describing it in another way.

MathewBiddle commented 10 months ago

I have a notebook for helping with datum translation https://github.com/MathewBiddle/sandbox/blob/main/notebooks/datum_translation.ipynb

emiliom commented 10 months ago

Quick comment: the coordinates are in WGS84 (EPSG 4326). I'm glad you caught the mistake! Sigh.

I'll follow up later.

emiliom commented 10 months ago

The way eventDate is currently formatted, it will be interpreted as UTC, not local time. I just want to make sure this is correct since it is a typical thing that can sneak under the radar.

Darn. I see that I incorrectly used "-0700" rather than "-07:00" to describe the UTC offset in proper iso8601 format. I assume that's the source of the problem, but can you confirm before I upload a corrected event file?

Auto-citation is turned off, and I want to make sure you know that it will be overwritten with the automated version on GBIF, but not OBIS. If this isn't ok, I can work with you to restructure the metadata so it matches your desired citation.

Ah. The only reason I turned it off was to be able to control the author order. I didn't see any other easy way, short of re-entering contacts in the desired order. Is there an alternative?

In description maybe say 'BCO-DMO' instead of 'previously published elsewhere', just so readers don't have to go looking for 'elsewhere'.

Will do.

Thanks for your review! I'm so glad you caught the basic mistakes with spatial and temporal coordinates. I'll provide an updated event table.

emiliom commented 10 months ago

An update on the UTC/timezone offset encoding. As it turns out, my strings are compliant iso8601 datetime strings! The UTC offset string is allowed to omit the : divider between the hour and minute parts, such that -0700 and -07:00 are both compliant. The form that uses a : is probably more common, though.

In my Python code, I was using the string datetime formatter strftime with the format string "%Y-%m-%dT%H:%M:%S%z". So, Python, not "me", was producing that valid UTC offset encoding. I just learned that strftime also provides the UTC offset formatting directive %:z to insert a colon divider, so I'll switch to that. (Update: darn, it was introduced very recently, in Python 3.12. I'd have to create a new conda environment, which could lead to new issues. Hmm)

But, if IPT/OBIS is not accepting as valid a UTC offset string that lacks a colon divider, and just drops that part of the iso8601 datetime string (ie, reads the string as UTC), I think that's an important issue that should ultimately be fixed on their end.

sformel-usgs commented 10 months ago

@emiliom thanks for following up on it so quickly. Unfortunately, you were the victim my tired brain! Your formatting is correct, I should have recognized that as UTC. I was in a bunch of meetings while you did a deep dive on ISO 8601, so I didn't stop you sooner. My sincere apologies!

However, I was able to fix the citation, so that's gotta be worth something :-) I used the copy function to make sure everyone was listed as an Associated Party and then reordered the Resource Creators using the copy function. Then I removed the Associated Parties I had added. Only took 5 min. I double checked it against your custom citation but give it a look all the same.

@MathewBiddle Do you have any strong feelings about how we give credit to NANOOS and IOOS?

emiliom commented 10 months ago

No worries about the iso8601 rabbit hole.

Thanks for "fixing" the citation to preserve the desired author order!

So, now it's done. Let me know if you plan to do a final quick look before I hit "Publish".

sformel-usgs commented 10 months ago

Perfect, just gave it a quick once-over. Good to go, publish when ready!

emiliom commented 10 months ago

DONE!! Thank you so much, @sformel-usgs and @albenson-usgs!

https://ipt-obis.gbif.us/resource?r=uwph_hoodcanalzoop

I can find it on OBIS at https://obis.org/dataset/5463caa4-b929-477f-ae6c-7007b6d91baa, though it looks like some of the information and plots are still getting populated. No complaint here -- it's been just a few minutes since I hit publish!

It'll still be great to hear from @MathewBiddle about the preferred way to give NANOOS & IOOS credit. What I wrote is here.

MathewBiddle commented 10 months ago

@sformel-usgs we should discuss a common approach for citing RAs and IOOS in EML metadata records. Right now it's a hodgepodge of implementations. I think the SCCOOS CALHABMAP dataset might be a good example to look at https://obis.org/dataset/c9aaa0e9-8f6c-4553-a014-a857baba0680.

I think @emiliom's approach by adding that info as additional metadata is sound, but I'm wondering if we could be more deliberate. I don't see any of the contacts associated with NANOOS, so that wont work. Maybe including that additional information in the abstract will ensure it's more visible?

To try and be consistent between different formats and services I started looking through the IOOS Metadata Profile v1.2. From what I can tell, there isn't a specific place to identify IOOS in the IOOS recommended metadata profile. I do see the use of contributor_name to identify the regional association in the netCDF construct. So, here me out, I think we should look how that might map back out to a specific element in EML.

MathewBiddle commented 10 months ago

Also, I think @albenson-usgs had a way to link to GH repositories for the processing scripts (here) and the source dataset (10.1575/1912/bco-dmo.682074.1) in the EML. It would be good to provide a little more connectivity if we can, instead of buried in the sampling methods.

But, maybe that was just through sciencebase? For example, see Related External Resources at https://www.sciencebase.gov/catalog/item/628698b2d34e3bef0c9a8b02

emiliom commented 10 months ago

Thanks, @MathewBiddle . It sounds like the topic of best practices for citing and acknowledging RAs and IOOS (and individuals associated with them, as appropriate) should be turned into a new issue in this repo. It's of wider interest, and it could lead to recommendations stated in either the Bio Data Guide, the IOOS Metadata Profile, or both.

I think @emiliom's approach by adding that info as additional metadata is sound, but I'm wondering if we could be more deliberate. I don't see any of the contacts associated with NANOOS, so that wont work. Maybe including that additional information in the abstract will ensure it's more visible?

My work on this dataset was in a NANOOS capacity. But in my organization affiliation, I listed myself under the University of Washington. That dual role / affiliation is common with people in RAs. But maybe for this context it would have been better to list my affiliation as NANOOS?

The abstract doesn't say anything about funding support, for either the original data collection and analysis or the data alignment to Darwin Core. That could be changed. But there is a difference between this case and the SCCOOS CALHABMAP example you point out . Based on my reading of the abstract alone, it sounds like SCCOOS supported the collection and analysis of the data as well as the submission to OBIS. In the dataset here, that happened independently of NANOOS. NANOOS' role was solely in the aligning the data to Darwin Core (plus shepherding some revisions) and submitting to OBIS. Should a best practice recommendation for the abstract be to include very brief acknowledgments of all relevant funding support?

MathewBiddle commented 10 months ago

Should a best practice recommendation for the abstract be to include very brief acknowledgments of all relevant funding support?

That scares me 😨 as it could get very long. I know NOAA Central Library has been looking at identifiers for awards. I'll see if there has been any progress on that.

For US MBON datasets, "Mathew Biddle will be listed as the distributor with US MBON as the institution." That allows US MBON to have a dashboard on OBIS for all associated datasets https://obis.org/institute/23070. @mwengren do we want one for US IOOS? If so, we can follow the process documented here https://ioos.github.io/mbon-docs/metadata-eml.html#appendix-how-to-create-an-oceanexpert-institution

emiliom commented 9 months ago

This is a tangent, but your mention of US MBON and the MBON docs link you provided pointed me to it: should we (NANOOS / me) take additional steps to get the dataset from OBIS into the MBON Portal? Based on what I see at https://ioos.github.io/mbon-docs/mbon-data-flow.html#loading-into-mbon-portal, it does look that way. Naively, I thought once the data was on OBIS, we're "done".

MathewBiddle commented 9 months ago

Hi @emiliom , thank you for your patience. Yes, we can add the data to the MBON data portal. And yes, once it's mobilized to OBIS we can pull it relatively easily, along with the appropriate metadata. For creating a layer in the MBON Data Portal, we prefer to have a conversation on what it is you would like to visualize in the case of these data. So, when you have the time, please browse to https://mbon.ioos.us/ and use the submit feedback button to request these data be added to the portal. Then, the team will reach out and discuss what we can do from there.

Feel free to ask any clarifying questions!