isamplesorg / metadata

Collation of metadata examples and notes for the project
https://isamplesorg.github.io/metadata/
8 stars 2 forks source link

Finalize categories and keywords for SESAR records #22

Open dannymandel opened 2 years ago

dannymandel commented 2 years ago

Quoting @smrgeoinfo:

In the SESAR_Metadata_Field_counts Google doc I added two worksheets—one with the SESAR fields that are populated in more than half their records (25 fields, 6 of which are for internal data management and not relevant), and a ‘lessThan50Percent’ sheet for the fields populated in less than half (56 fields, 1 internal). These are all mapped into fields in the full iSamples draft metadata model (we’re not using that yet…). Unfortunately, the field names are not the same as what we get in the JSON-LD from the SESAR API, so there’s another mapping exercise (database field name to JSON-LD Key), hopefully I’ll get to that today.

For mapping sample descriptions to the appropriate controlled vocabulary terms for hasMaterialCategory (materialType), base on material key in the JSON-LD

hasSpecimenCategory (SpecimenType) I think we can get from sampleType key in JSON-LD

hasContextCategory (SampledFeatureType) This one’s more complicated, here’s a start at the logic: is it an Ocean Drilling Program sample (assume its core from the ocean floor) or is the material rock or a subtype of rock or is the sampleType ‘Core’; hasContextCategory is RockBody is material type coral, (ideally verify that its from a living coral species—check field name and geologic age): hasContextCategory is MarineBiome Is platform type a submersible (e.g. Alvin)? If Material type is rock or gas hasContextCategory is ‘Marine water body bottom’ primaryLocationType is good, but only populated in 268,454 records;

So What we need to help is unique values for {material, sampleType} (not sure what the corresponding field in the SESAR database is, maybe classification_id and top_level_classification_id? sampleType is sample_type_id in database. I think we could filter out all the material = rock, and perhaps also sampleType = core to simplify unique values for sampleType Unique values for {platform, material} Unique values for primaryLocationtype Some way to identify meteorites and other extraterrestrial stuff hasContextCategory is ‘Extraterrestrial Environment’

dannymandel commented 2 years ago

The attached files show our view of the world of the following SESAR JSON-LD fields:

sampleType primaryLocationType platformType material

sesar_material_types_query.txt sesar_material_types.csv sesar_platform_types_query.txt sesar_platform_types.csv sesar_primary_location_types_query.txt sesar_primary_location_types.csv sesar_sample_types_query.txt sesar_sample_types.csv

dannymandel commented 2 years ago

The files were obtained by examining the JSON in our copy of SESAR's records on mars.cyverse.org.

dannymandel commented 2 years ago
copy (select resolved_content->'description'->'material' as material, count(*) as cnt from thing where authority_id='SESAR' group by material order by cnt desc) to STDOUT (DELIMITER ',');
copy (select resolved_content->'description'->'supplementMetadata'->'platformType' as platformType, count(*) as cnt from thing where authority_id='SESAR' group by platformType order by cnt desc) to STDOUT (DELIMITER ',');
copy (select resolved_content->'description'->'supplementMetadata'->'primaryLocationType' as primaryLocationType, count(*) as cnt from thing where authority_id='SESAR' group by primaryLocationType order by cnt desc) to STDOUT (DELIMITER ',');
copy (select resolved_content->'description'->'sampleType' as sampleType, count(*) as cnt from thing where authority_id='SESAR' group by sampleType order by cnt desc) to STDOUT (DELIMITER ',');
smrgeoinfo commented 2 years ago

Thanks @dannymandel I created a spreadsheet with the compiled terms and mapping to the draft controlled vocabulary. Its in the GitHub Vocabulary directory (https://github.com/isamplesorg/metadata/blob/main/vocabulary/SESARVocabularyMapping.xlsx)
Many of the mappings are straight forward, but the sampled feature is tricky; we might need a list of unique combinations of location type and material type. Location type is free text so there's a lot of different stuff in there.

This should allow us to focus on problematic stuff in tomorrow's meeting.

dannymandel commented 2 years ago

Thanks, this looks great. I can get started integrating it.

I'll see if I can get unique combinations of location type and material type, too.

dannymandel commented 2 years ago

OK, here's the distinct combo of location and material types:

sesar_location_type_material_type.csv

copy (select distinct resolved_content->'description'->'supplementMetadata'->'primaryLocationType' as primaryLocationType, resolved_content->'description'->'material' as material, count(*) as cnt from thing where authority_id='SESAR' group by primaryLocationType, material order by cnt desc)  to STDOUT (DELIMITER ',');
dannymandel commented 2 years ago

I wasn't sure if you wanted only the non-null ones, too, so here's that:

sesar_location_type_material_type_both_non_null.csv

copy (select distinct resolved_content->'description'->'supplementMetadata'->'primaryLocationType' as primaryLocationType, resolved_content->'description'->'material' as material, count(*) as cnt from thing where authority_id='SESAR' and resolved_content->'description'->'supplementMetadata'->'primaryLocationType' is not NULL and resolved_content->'description'->'material' is not NULL group by primaryLocationType, material order by cnt desc)  to STDOUT (DELIMITER ',');
smrgeoinfo commented 2 years ago

The (updated) mappings we discussed this morning (2021-06-09) is now in a google spreadsheet https://docs.google.com/spreadsheets/d/1QitBRkWH6YySZnNO-uR7D2rTaQ826WPT_xow9lPdJDM

dannymandel commented 2 years ago

Thanks @smrgeoinfo! I've been following along with it and have implemented SpecimenType and MaterialType. Currently looking at SampledFeature.