ioos / bio_data_guide

Standardizing Marine Biological Data Working Group - An open community to facilitate the mobilization of biological data to OBIS.
https://ioos.github.io/bio_data_guide/
MIT License
46 stars 21 forks source link

[dataset]: MarineGEO Global Seagrass Habitat Monitoring Data #186

Open mlonneman opened 1 year ago

mlonneman commented 1 year ago

Dataset Name

marinegeo-seagrass-monitoring

Link to DwC Data Files

https://www.dropbox.com/sh/h9bapgplj1bmfk4/AADnz_IF8hOo89fi6o3sMeDBa?dl=0

Link to Scripts

No response

Link to "raw" Data Files

No response

Describe your dataset and any specific requests.

I've loaded everything into the US IPT, but have included a Dropbox link for anyone who wants to view the data.

Besides any issues I've missed, I'd like to get feedback on how we've structured the event table. The seagrass surveys are taken along a transect. At certain meter points along the transect, cover and density measurements are taken and epifauna samples are collected. Because these aren't the same sample (cover and density use different quadrat sizes, the epifauna is collected outside the quadrat area), we've broken each sample into a separate event. Here's a graphic of how I've structured it:

image

Finally, I'm including the abstract for a bit more context of the data: This dataset contains seagrass time-series monitoring data collected by the Smithsonian Institution’s MarineGEO (Marine Global Earth Observatory) and its network partners. The dataset includes measurements on seagrass density, seagrass cover, and epifaunal invertebrates. The data is collected using MarineGEO’s standard survey design for sampling seagrass habitats. MarineGEO seagrass monitoring at each site is conducted along three 50-m fixed transects intended to be permanent and sampled regularly (usually annually). Transects are located parallel to shore and along the shallow (inshore), middle (interior), and deep (offshore) parts of the seagrass bed, although variation can occur due to local site conditions. The seagrass cover measurements are taken every 4 meters along the transect (n = 12 per transect), the seagrass density measurements are taken alongside the cover measurements at every other replicate (n = 6 per transect), and epifaunal community samples are collected every 8 meters (n = 3 per transect). Seagrass cover is measured by placing a quadrat (recommended size is 50 x 50 cm) alongside the transect line and estimating the cover of each seagrass species, other sessile organisms, and bare substratum. Seagrass density is measured by counting the number of seagrass shoots within a quadrat (recommended size is 25 x 25 cm). Epifauna samples are collected by randomly selecting an area ~ 1 meter to any side of the transect and quickly removing seagrass shoots with associated epifauna into a mesh bag; the animals are later identified in a laboratory by the data collectors. All data has undergone QA/QC by MarineGEO Central at the Smithsonian after submission by MarineGEO partners. Note that the dataset may include submerged aquatic vegetation (not technically seagrass) in brackish to tidal freshwater sites collected with the MarineGEO seagrass protocol.

Thank you all for the feedback!

albenson-usgs commented 1 year ago

Thanks Michael! The data look really good. Only a few minor changes and questions. The occurrence and emof files are good to go. The only comments I had are on the event file.

  1. institutionID needs to be an identifier (right now your institutionCode is exactly the same as your institutionID) I recommend people use a ROR for their institutionID but I couldn't find MarineGEO in there. It might be worth considering adding it. But until you have a unique identifier of some sort I would recommend dropping institutionID.
  2. It would be better to put the license in license so put "https://creativecommons.org/licenses/by/4.0/" instead of "Attribution" so people know exactly what you mean.
  3. Really excited to see the DOI to the sampling protocol! 🙌
  4. There are some missing parentEventIDs e.g. 2022-01-26:MarineGEO:PAN-BDT_seagrass_2022:STRI Seagrass:2 is one that is missing. To find them all you can use obistools::check_eventids(marinegeo_seagrass_monitoring_event) in R.
  5. There are 1185 eventIDs but only 715 in the occurrence file. That means there are 470 events with no occurrences. Just checking that's accurate.

Now for your questions.

mlonneman commented 1 year ago

Thank you, Abby! I hope to make the necessary updates in the next couple days.

In the meantime, question about authors: Is there a limit/best practice for the number of associatedParty individuals associated with a dataset? I'd like to credit our partner data collectors as "content providers", but that could grow to be > 100 individuals in a few years. Will that become a problem?

albenson-usgs commented 1 year ago

Apologies for the delay Michael. I asked the GBIF and OBIS helpdesks about this and they haven't responded yet. NCEI said it's no problem for them.

mlonneman commented 1 year ago

OK, I've updated the files in the IPT! All parentEventID and empty eventIDs (eventIDs that shouldn't have existed in the event table because they had no occurrences/child events) should be resolved. I've also included coords and depth for all child events (non-transect events).

Thank you for feedback on the associated parties.

cperaltab commented 1 year ago

Hi! This is a very interesting discussion. After reviewing the data spreadsheets shared by Michael, I think that by putting the parentEventID into the occurrence extension table, the "inheritance" issue could be solved and you will not need to repeat the coordinates and depth for all your event table. If I'm not misunderstanding, once you define your parentEvent coordinates this will be linked to your occurrence record ONLY IF you put the parentEventID in this table together with the eventID. I'm attaching an example spreadsheet to see if I'm understanding this correctly and to see if it helps to understand this structure. I'm happy to continue the discussion and find out the best practice for this data formatting task. Looking forward to any feedback.

example.xlsx

albenson-usgs commented 1 year ago

Thank you Carolina! Unfortunately this does not solve the problem that in GBIF there will not be coordinates for the occurrences since they are at the quadrat level. Also my understanding is that if you are using event core you would not include parentEventID in the occurrence extension because it needs to be documented in the event table and you would not want to repeat it in the occurrence extension since eventID is what links the event core with the occurrence extension. Happy to discuss this at the next SMBD meeting if needed.

cperaltab commented 1 year ago

hmm... I feel confused now. In the past I have been putting the parentEventID in my occurrence table. Now I need to go one step back and review again for what we need the parentEventID? I agree to discuss this in the next SMBD. Thank you Abby!

albenson-usgs commented 1 year ago

One follow up on the topic of multiple associated party- GBIF asks: Is there a reason for putting data contributors as content providers in the metadata rather than adding them in the dataset itself as collectors/identifiers etc.?

I admit this is a good question. Is there a reason not to document these folks in Darwin Core terms like recordedBy, identifiedBy, georeferencedBy, or measurementDeterminedBy ?

They did go on to say- There is no technical limit to the amount of content providers you can include. However, if we reach thousands, there might be some performance issues.

ymgan commented 1 year ago

Is there a reason not to document these folks in Darwin Core terms like recordedBy, identifiedBy, georeferencedBy, or measurementDeterminedBy ?

I second this, it will be great to include the ID parts e.g. recordedByID, identifiedByID :) because people can write their name differently, or have same names ... This information will also be harvested by Bionomia and the people will also get the credit when users download aggregated data from GBIF/OBIS.

mlonneman commented 1 year ago

I mainly wanted to do this for acknowledgement purposes (associating data collectors with the publication on the IPT). I haven't heard about Bionomia, that's great that it provides credit based on that metadata.