microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

GOLD API - Experiment #136

Closed ssarrafan closed 2 years ago

ssarrafan commented 3 years ago

Bill will experiment with the GOLD API by querying the API and pulling Spruce data using the API and then comparing that data to what's in the Mongo DB for validation

ssarrafan commented 3 years ago

@dwinston Bill wanted you to be aware of this GitHub issue

cmungall commented 3 years ago

@wdduncan - do you intend to use the same mapping file: https://github.com/microbiomedata/nmdc-schema/blob/main/sssom/gold-to-mixs.sssom.tsv

It looks like the fields in the API JSON payload are the same as the fields in the gold database, with camelCasing applied

e/g https://github.com/microbiomedata/nmdc-schema/blob/main/sssom/gold-to-mixs.sssom.tsv#L14

subject subject label predicate object object label type
gold.vocab:salinity_concentration salinity_concentration skos:exactMatch mixs:salinity salinity SSSOMC:HumanCurated

so you could reuse the same table and apply a deterministic camelCase function to go from salinity_concentration (GOLD SQL) -> salinityConcentration (GOLD API) -> salinity (MIxS)?

(can someone tag Reddy to confirm, is he not in this org?)

cmungall commented 3 years ago

Let's timebox 1-3 hours for this experiment and check in before getting in too deep

wdduncan commented 3 years ago

@cmungall I will have to either create a new sssom file or add the camel case fields to the current one. I lean towards the latter, but can be go with the forme too.

cmungall commented 3 years ago

The latter won't work as sssom is for pairwise mappings only

if you do go with the former, should it not be automatically generated? Do you even need the mapping file, and instead do the mapping to camelcase on the fly?

wdduncan commented 3 years ago

Since the camel case labels will be different can't I just add them to the sssom file? E.g:

subject subject label predicate object object label type
gold.vocab:salinityConcentration salinityConcentration skos:exactMatch mixs:salinity salinity SSSOMC:HumanCurated

In cases in which there is not any camel casing (e.g., depth, altitude) the mapping to the mixs term would just remain the same.

ssarrafan commented 2 years ago

@cmungall and @wdduncan should this issue be closed, moved to the backlog or moved to the October sprint?

wdduncan commented 2 years ago

@ssarrafan I do not know if the GOLD API is ready for me to start using in earnest yet. If it is, then this can go on the October sprint. What do you think @cmungall ?

ssarrafan commented 2 years ago

I'll go ahead and move this to the October sprint but please let me know if it should be in the backlog or assigned to someone else. @wdduncan @emileyfadrosh @dehays

cmungall commented 2 years ago

I spoke to @wdduncan earlier today

I would also like us to be working against a schema or formal data dictionary from the gold api. Ideally this would come from gold e.g. in swagger/openAPI, but we can use some of our linkml tooling to infer the schema (the linkml-model-enrichment package @turbomam works on)

For example, here is a partial schema induced from the neon samples

curl  -u j USERNAME:PASSWORD  https://gold.jgi.doe.gov/rest/nmdc/biosamples?studyGoldId=Gs0144570 | jq . > example.json

jsondata2linkml examples.json > example_schema.yaml

yields:

id: https://w3id.org/EnvoBroadScale
name: EnvoBroadScale
description: EnvoBroadScale
imports:
- linkml:types
prefixes:
  linkml: https://w3id.org/linkml/
  EnvoBroadScale: https://w3id.org/EnvoBroadScale
default_prefix: EnvoBroadScale
types:
  USA identifier:
    typeof: string
classes:
  EnvoBroadScale:
    slots:
    - id
    - label
    slot_usage: {}
  EnvoLocalScale:
    slots:
    - id
    - label
    slot_usage: {}
  EnvoMedium:
    slots:
    - id
    - label
    slot_usage: {}
  Contacts:
    slots:
    - name
    - email
    - jgiSsoId
    - roles
    slot_usage: {}
  Biosample:
    slots:
    - biosampleGoldId
    - biosampleName
    - sampleCollectionSite
    - geographicLocation
    - latitude
    - longitude
    - ecosystemPathId
    - ecosystem
    - ecosystemCategory
    - ecosystemType
    - ecosystemSubtype
    - specificEcosystem
    - description
    - hostDiseases
    - geoLocation
    - habitat
    - isoCountry
    - mixsPackage
    - envoBroadScale
    - envoLocalScale
    - envoMedium
    - addDate
    - contacts
    - modDate
    slot_usage: {}
slots:
  id:
    range: id_enum
    examples:
    - value: ENVO_00000446
  label:
    range: label_enum
    examples:
    - value: terrestrial biome
  name:
    range: name_enum
    examples:
    - value: Supratim Mukherjee
  email:
    range: email_enum
    examples:
    - value: supratimmukherjee@lbl.gov
  jgiSsoId:
    range: integer
    examples:
    - value: '5631'
  roles:
    range: roles_enum
    examples:
    - value:
      - submitter
  biosampleGoldId:
    range: string
    examples:
    - value: Gb0255604
  biosampleName:
    range: string
    examples:
    - value: Core terrestrial soil microbial communities from Central Plains Experimental
        Range, Central Plains, CO, USA - CPER_001-M-20140715-COMP-DNA1
  sampleCollectionSite:
    range: sampleCollectionSite_enum
    examples:
    - value: Soil
  geographicLocation:
    range: USA identifier
    examples:
    - value: 'USA: Central Plains Experimental Range, Central Plains, CO'
  latitude:
    range: float
    examples:
    - value: 40.81553
  longitude:
    range: float
    examples:
    - value: -104.7456
  ecosystemPathId:
    range: integer
    examples:
    - value: 4212
  ecosystem:
    range: ecosystem_enum
    examples:
    - value: Environmental
  ecosystemCategory:
    range: ecosystemCategory_enum
    examples:
    - value: Terrestrial
  ecosystemType:
    range: ecosystemType_enum
    examples:
    - value: Soil
  ecosystemSubtype:
    range: ecosystemSubtype_enum
    examples:
    - value: Unclassified
  specificEcosystem:
    range: specificEcosystem_enum
    examples:
    - value: Unclassified
  description:
    range: string
    examples:
    - value: Core terrestrial soil microbial communities from Central Plains Experimental
        Range, Central Plains, CO, USA
  hostDiseases:
    range: string
    examples:
    - value: []
  geoLocation:
    range: USA identifier
    examples:
    - value: 'USA: Central Plains Experimental Range, Central Plains, CO'
  habitat:
    range: habitat_enum
    examples:
    - value: Core terrestrial soil
  isoCountry:
    range: isoCountry_enum
    examples:
    - value: USA
  mixsPackage:
    range: mixsPackage_enum
    examples:
    - value: Standard
  envoBroadScale:
    range: string
  envoLocalScale:
    range: string
  envoMedium:
    range: string
  addDate:
    range: datetime
    examples:
    - value: '2020-01-27'
  contacts:
    range: Contacts
    examples:
    - value:
      - $ref:Contacts
      - $ref:Contacts
    multivalued: true
  modDate:
    range: datetime
    examples:
    - value: '2020-01-27'
enums:
  id_enum:
    permissible_values:
      ENVO_00000446:
        description: ENVO_00000446
      ENVO_01000177:
        description: ENVO_01000177
      ENVO_01000179:
        description: ENVO_01000179
      ENVO_01000180:
        description: ENVO_01000180
      ENVO_01000174:
        description: ENVO_01000174
      ENVO_01000198:
        description: ENVO_01000198
  label_enum:
    permissible_values:
      tundra biome:
        description: tundra biome
      forest biome:
        description: forest biome
      terrestrial biome:
        description: terrestrial biome
      grassland biome:
        description: grassland biome
      desert biome:
        description: desert biome
      mixed forest biome:
        description: mixed forest biome
  name_enum:
    permissible_values:
      Russell Neches:
        description: Russell Neches
      Janet Jansson:
        description: Janet Jansson
      Ruonan Wu:
        description: Ruonan Wu
      Emily Graham:
        description: Emily Graham
      Supratim Mukherjee:
        description: Supratim Mukherjee
  email_enum:
    permissible_values:
      ...
  roles_enum:
    permissible_values:
      submitter:
        description: submitter
      other:
        description: other
  sampleCollectionSite_enum:
    permissible_values:
      Soil:
        description: Soil
      Forest soil:
        description: Forest soil
  ecosystem_enum:
    permissible_values:
      Environmental:
        description: Environmental
  ecosystemCategory_enum:
    permissible_values:
      Terrestrial:
        description: Terrestrial
  ecosystemType_enum:
    permissible_values:
      Soil:
        description: Soil
  ecosystemSubtype_enum:
    permissible_values:
      Unclassified:
        description: Unclassified
      Temperate forest:
        description: Temperate forest
  specificEcosystem_enum:
    permissible_values:
      Farm:
        description: Farm
      Unclassified:
        description: Unclassified
      Desert:
        description: Desert
      Bulk soil:
        description: Bulk soil
  habitat_enum:
    permissible_values:
      Mixed forest soil:
        description: Mixed forest soil
      Core terrestrial soil:
        description: Core terrestrial soil
      Relocatable terrestrial soil:
        description: Relocatable terrestrial soil
  isoCountry_enum:
    permissible_values:
      USA:
        description: USA
  mixsPackage_enum:
    permissible_values:
      Standard:
        description: Standard
ssarrafan commented 2 years ago

Checked in with @emileyfadrosh and she suggested we close this issue. @wdduncan and @cmungall I will close it but let me know if you'd like it reopened for any reason.