Closed ssarrafan closed 2 years ago
@dwinston Bill wanted you to be aware of this GitHub issue
@wdduncan - do you intend to use the same mapping file: https://github.com/microbiomedata/nmdc-schema/blob/main/sssom/gold-to-mixs.sssom.tsv
It looks like the fields in the API JSON payload are the same as the fields in the gold database, with camelCasing applied
e/g https://github.com/microbiomedata/nmdc-schema/blob/main/sssom/gold-to-mixs.sssom.tsv#L14
subject | subject label | predicate | object | object label | type |
---|---|---|---|---|---|
gold.vocab:salinity_concentration | salinity_concentration | skos:exactMatch | mixs:salinity | salinity | SSSOMC:HumanCurated |
so you could reuse the same table and apply a deterministic camelCase function to go from salinity_concentration (GOLD SQL) -> salinityConcentration (GOLD API) -> salinity (MIxS)?
(can someone tag Reddy to confirm, is he not in this org?)
Let's timebox 1-3 hours for this experiment and check in before getting in too deep
@cmungall I will have to either create a new sssom file or add the camel case fields to the current one. I lean towards the latter, but can be go with the forme too.
The latter won't work as sssom is for pairwise mappings only
if you do go with the former, should it not be automatically generated? Do you even need the mapping file, and instead do the mapping to camelcase on the fly?
Since the camel case labels will be different can't I just add them to the sssom file? E.g:
subject | subject label | predicate | object | object label | type |
---|---|---|---|---|---|
gold.vocab:salinityConcentration | salinityConcentration | skos:exactMatch | mixs:salinity | salinity | SSSOMC:HumanCurated |
In cases in which there is not any camel casing (e.g., depth, altitude) the mapping to the mixs term would just remain the same.
@cmungall and @wdduncan should this issue be closed, moved to the backlog or moved to the October sprint?
@ssarrafan I do not know if the GOLD API is ready for me to start using in earnest yet. If it is, then this can go on the October sprint. What do you think @cmungall ?
I'll go ahead and move this to the October sprint but please let me know if it should be in the backlog or assigned to someone else. @wdduncan @emileyfadrosh @dehays
I spoke to @wdduncan earlier today
I would also like us to be working against a schema or formal data dictionary from the gold api. Ideally this would come from gold e.g. in swagger/openAPI, but we can use some of our linkml tooling to infer the schema (the linkml-model-enrichment package @turbomam works on)
For example, here is a partial schema induced from the neon samples
curl -u j USERNAME:PASSWORD https://gold.jgi.doe.gov/rest/nmdc/biosamples?studyGoldId=Gs0144570 | jq . > example.json
jsondata2linkml examples.json > example_schema.yaml
yields:
id: https://w3id.org/EnvoBroadScale
name: EnvoBroadScale
description: EnvoBroadScale
imports:
- linkml:types
prefixes:
linkml: https://w3id.org/linkml/
EnvoBroadScale: https://w3id.org/EnvoBroadScale
default_prefix: EnvoBroadScale
types:
USA identifier:
typeof: string
classes:
EnvoBroadScale:
slots:
- id
- label
slot_usage: {}
EnvoLocalScale:
slots:
- id
- label
slot_usage: {}
EnvoMedium:
slots:
- id
- label
slot_usage: {}
Contacts:
slots:
- name
- email
- jgiSsoId
- roles
slot_usage: {}
Biosample:
slots:
- biosampleGoldId
- biosampleName
- sampleCollectionSite
- geographicLocation
- latitude
- longitude
- ecosystemPathId
- ecosystem
- ecosystemCategory
- ecosystemType
- ecosystemSubtype
- specificEcosystem
- description
- hostDiseases
- geoLocation
- habitat
- isoCountry
- mixsPackage
- envoBroadScale
- envoLocalScale
- envoMedium
- addDate
- contacts
- modDate
slot_usage: {}
slots:
id:
range: id_enum
examples:
- value: ENVO_00000446
label:
range: label_enum
examples:
- value: terrestrial biome
name:
range: name_enum
examples:
- value: Supratim Mukherjee
email:
range: email_enum
examples:
- value: supratimmukherjee@lbl.gov
jgiSsoId:
range: integer
examples:
- value: '5631'
roles:
range: roles_enum
examples:
- value:
- submitter
biosampleGoldId:
range: string
examples:
- value: Gb0255604
biosampleName:
range: string
examples:
- value: Core terrestrial soil microbial communities from Central Plains Experimental
Range, Central Plains, CO, USA - CPER_001-M-20140715-COMP-DNA1
sampleCollectionSite:
range: sampleCollectionSite_enum
examples:
- value: Soil
geographicLocation:
range: USA identifier
examples:
- value: 'USA: Central Plains Experimental Range, Central Plains, CO'
latitude:
range: float
examples:
- value: 40.81553
longitude:
range: float
examples:
- value: -104.7456
ecosystemPathId:
range: integer
examples:
- value: 4212
ecosystem:
range: ecosystem_enum
examples:
- value: Environmental
ecosystemCategory:
range: ecosystemCategory_enum
examples:
- value: Terrestrial
ecosystemType:
range: ecosystemType_enum
examples:
- value: Soil
ecosystemSubtype:
range: ecosystemSubtype_enum
examples:
- value: Unclassified
specificEcosystem:
range: specificEcosystem_enum
examples:
- value: Unclassified
description:
range: string
examples:
- value: Core terrestrial soil microbial communities from Central Plains Experimental
Range, Central Plains, CO, USA
hostDiseases:
range: string
examples:
- value: []
geoLocation:
range: USA identifier
examples:
- value: 'USA: Central Plains Experimental Range, Central Plains, CO'
habitat:
range: habitat_enum
examples:
- value: Core terrestrial soil
isoCountry:
range: isoCountry_enum
examples:
- value: USA
mixsPackage:
range: mixsPackage_enum
examples:
- value: Standard
envoBroadScale:
range: string
envoLocalScale:
range: string
envoMedium:
range: string
addDate:
range: datetime
examples:
- value: '2020-01-27'
contacts:
range: Contacts
examples:
- value:
- $ref:Contacts
- $ref:Contacts
multivalued: true
modDate:
range: datetime
examples:
- value: '2020-01-27'
enums:
id_enum:
permissible_values:
ENVO_00000446:
description: ENVO_00000446
ENVO_01000177:
description: ENVO_01000177
ENVO_01000179:
description: ENVO_01000179
ENVO_01000180:
description: ENVO_01000180
ENVO_01000174:
description: ENVO_01000174
ENVO_01000198:
description: ENVO_01000198
label_enum:
permissible_values:
tundra biome:
description: tundra biome
forest biome:
description: forest biome
terrestrial biome:
description: terrestrial biome
grassland biome:
description: grassland biome
desert biome:
description: desert biome
mixed forest biome:
description: mixed forest biome
name_enum:
permissible_values:
Russell Neches:
description: Russell Neches
Janet Jansson:
description: Janet Jansson
Ruonan Wu:
description: Ruonan Wu
Emily Graham:
description: Emily Graham
Supratim Mukherjee:
description: Supratim Mukherjee
email_enum:
permissible_values:
...
roles_enum:
permissible_values:
submitter:
description: submitter
other:
description: other
sampleCollectionSite_enum:
permissible_values:
Soil:
description: Soil
Forest soil:
description: Forest soil
ecosystem_enum:
permissible_values:
Environmental:
description: Environmental
ecosystemCategory_enum:
permissible_values:
Terrestrial:
description: Terrestrial
ecosystemType_enum:
permissible_values:
Soil:
description: Soil
ecosystemSubtype_enum:
permissible_values:
Unclassified:
description: Unclassified
Temperate forest:
description: Temperate forest
specificEcosystem_enum:
permissible_values:
Farm:
description: Farm
Unclassified:
description: Unclassified
Desert:
description: Desert
Bulk soil:
description: Bulk soil
habitat_enum:
permissible_values:
Mixed forest soil:
description: Mixed forest soil
Core terrestrial soil:
description: Core terrestrial soil
Relocatable terrestrial soil:
description: Relocatable terrestrial soil
isoCountry_enum:
permissible_values:
USA:
description: USA
mixsPackage_enum:
permissible_values:
Standard:
description: Standard
Checked in with @emileyfadrosh and she suggested we close this issue. @wdduncan and @cmungall I will close it but let me know if you'd like it reopened for any reason.
Bill will experiment with the GOLD API by querying the API and pulling Spruce data using the API and then comparing that data to what's in the Mongo DB for validation