ioos / bio_data_guide

Standardizing Marine Biological Data Working Group - An open community to facilitate the mobilization of biological data to OBIS.
https://ioos.github.io/bio_data_guide/
MIT License
46 stars 21 forks source link

[dataset]: DFO BioChem Plankton Data #222

Open EOGrady21 opened 9 months ago

EOGrady21 commented 9 months ago

Contact details

emily.ogrady@dfo-mpo.gc.ca

Dataset Title

2023 AZMP Zooplankton Data

Describe your dataset and any specific challenges or blockers you have or anticipate.

Data is available in raw excel format with headers: mission, date, station, tow, gear ID, event ID, sample ID, depth, split, aliquot, taxa, stage, sex, count

The data are also available through an SQL database where there is additional metadata, but it would be preferred to load data from the raw spreadsheets.

The main hurdle to submission is formatting data into OBIS requirements. It would be ideal if there was an automated pipeline that could make formatting be a less resource intensive task.

Info about "raw" Data Files.

No response

EOGrady21 commented 7 months ago

After discussion with the workshop leaders, I decided to focus on a smaller 2022 dataset for my first publication BBMP2022_plankton.csv

I have an initial variable map: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

BioChem | DWC -- | -- MISSION_NAME |   MISSION_DESCRIPTOR | eventID PROTOCOL |   START_DATE_EVENT | eventDate START_DATE_HEAD |   COLLECTOR_STATION_NAME |   COLLECTOR_EVENT_ID | eventID COLLECTOR_COMMENT_EVENT | eventRemarks START_DEPTH | minimumDepthInMeters END_DEPTH | maximumDepthInMeters MESH_SIZE | measurementType:mesh size COLLECTOR_SAMPLE_ID | eventID COLLECTOR_HEADERS |   COLLECTOR_COMMENT_HEADERS MIN_SIEVE | measurementType: minimum sieve MAX_SIEVE | measurementType: maximum sieve SPLIT_FRACTION | measurementType: split fraction NATIONAL_TAXONOMIC_SEQ |   COLLECTOR_TAXONOMIC_ID | verbatimIdentification TAXONOMIC_NAME |   MODIFIER |   STAGE |   MOLT_NUMBER | measurementType: molt number SEX | sex COUNTS | individualCount WET_WEIGHT | measurementType: wet weight DRY_WEIGHT | measurementType: dry weight COLLECTOR_COMMENT_GEN |   SOURCE |   CREATED_DATE |   PROD_CREATED_DATE |   MIN_LAT | decimalLatitude MAX_LAT |   MIN_LON | decimalLongitude LEADER |   PLATFORM |   START_DATE | eventDate END_DATE | eventDate PHASE_OF_DAYLIGHT |   SOUNDING | measurementType: total bottom depth VOLUME | measurementType: volume LARGE_PLANKTON_REMOVED |   COLLECTION_METHOD_NAME | measurementType: collection method PROCEDURE_NAME | measurementType: procedure VOLUME_METHOD_NAME | measurementType: volume method HEADER_START_LAT | decimalLatitude HEADER_END_LAT |   HEADER_START_LON | decimalLongitude HEADER_END_LON |   HEADER_END_TIME | eventTime HEADER_START_TIME | eventTime HEADER_END_DATE | eventDate LIFE_HISTORY_NAME | lifeStage BEST_NODC7 |   EVENT_START_TIME | eventTime EVENT_END_TIME | eventTime EVENT_MIN_LON | decimalLongitude EVENT_MAX_LON |   EVENT_MIN_LAT | decimalLatitude EVENT_MAX_LAT |   EVENT_END_DATE | eventDate UTC_OFFSET |   GEAR_TYPE | measurementType: gear type GEAR_MODEL | measurementType: gear model GEAR_SIZE | measurementType: gear size TSN_ITIS |   AUTHORITY |   TSN |   APHIAID | identificationID PRESERVATION_NAME | measurementType: preservation

I need to confirm the measurementType names, I see that it is recommended "to use a controlled vocabulary", but I'm not sure which vocabulary would encompass this very specific metadata.

I also note that a lot of metadata is not being translated, this is for simplicity. BioChem currently includes WoRMS, TSN and BioChem identifications. Some other metadata like location data has multiple points in BioChem (start and end points) which will be reduced to a single point in DarwinCore.

EOGrady21 commented 7 months ago

My goal is to develop a simple R package to process this dataset, due to the volume of data I hope to eventually push through. This will make the process as reproducible and efficient as possible.

The steps of processing will be:

MathewBiddle commented 7 months ago

@EOGrady21 give us a shout if you need any help!

EOGrady21 commented 7 months ago

A more polished version of my column mapping: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

OBIS | BioChem | notes | tag -- | -- | -- | -- occurenceID |   | generate programmatically | occurrence basisOfRecord |   | FILL PROGRAMATICALLY[materialSample] | occurrence scientificName |   | FILL FROM APHIAID W WORRMS | occurrence scientificNameID | APHIAID |   | occurrence occurenceStatus |   | present or absent | occurrence verbatimIdentification | COLLECTOR_TAXONOMIC_ID |   | occurrence sex | SEX | add dwciri:sex and standardize | occurrence taxonRank |   | FILL FROM APHIAID W WORRMS | occurrence kingdom |   | FILL FROM APHIAID W WORRMS | occurrence phylum |   | FILL FROM APHIAID W WORRMS | occurrence class |   | FILL FROM APHIAID W WORRMS | occurrence order |   | FILL FROM APHIAID W WORRMS | occurrence family |   | FILL FROM APHIAID W WORRMS | occurrence genus |   | FILL FROM APHIAID W WORRMS | occurrence scientificNameAuthorship |   | FILL FROM APHIAID W WORRMS | occurrence lifeStage | LIFE_HISTORY_NAME | add dwciri:lifeStage and standardize | occurrence eventID | MISSION_DESCRIPTOR | eventType: cruise | event eventID | COLLECTOR_EVENT_ID | eventType: event | event eventDate |   | START_DATE - END_DATE, eventType: cruise | event eventDate | START_DATE_EVENT | eventType: event | event eventTime | EVENT_START_TIME | be sure to format with UTC_OFFSET | event decimalLatitude |   | FILL FROM MIN_LAT MAX_LAT USING OBISTOOLS::CALCULATE_CENTROID | event decimalLongitude |   | FILL FROM MIN_LON MAX_LON USING OBISTOOLS::CALCULATE_CENTROID | event geodeticDatum |   | WGS84 | event minimumDepthInMeters | START_DEPTH |   | event maximumDepthInMeters | END_DEPTH |   | event samplingProtocol |   | CONCATENATE GEAR_TYPE, GEAR MODEL, GEAR SIZE, PRESERVATION, collection_method_name, procedure | event eventRemarks | COLLECTOR_COMMENT_EVENT | | event individualCount | COUNTS |   | emof sampleSizeValue | VOLUME | needs sampleSizeUnit | emof measurementValue | WET_WEIGHT | measurementType: Zooplankton wet weight biomass, measurementTypeID:SDN:P02::GP079 | emof measurementType: dry weight | DRY_WEIGHT | measurementType: Zooplankton dry weight biomass per unit volume of the water column, measurementTypeID: SDN:P02::MSBD | emof

Matched some of the measurements with P02 terms, note the reduction of metadata. This came from discussion with SME's about where the majority of scientific value is, this is a more manageable map that still gives high value information.

Next step, start coding a pipeline! :)