microbiomedata / issues

public repo for issues related to NMDC work
1 stars 0 forks source link

Ensure that soil biosamples have two depth values (minimum and maximum) #17

Open turbomam opened 1 year ago

turbomam commented 1 year ago
turbomam commented 1 year ago

Below you will see a partially-migrated Biosample, which @mslarae13 identified as a soil biosample, lacking a has_minimum_numeric_value

Is there one slot here whose value is always indicative of a soil biosample and never indicative of some other kind of biosample? I don't think it would be appropriate to blindly do a free-text search across the whole document and say that, if the string "soil" appears anywhere, then it is a soil biosample.

Let's consider {"ecosystem_type": "Soil"}. That query returns 548 biosamples out of the 753.

But this query shows that it doesn't map to any other MIxS environmental packages.

db["biosample_set"].distinct("ecosystem_type").forEach(function(value){print("ecosystem_type" + ", " + value + ": " + db["biosample_set"].count({["ecosystem_type"]: value}))})

'DeprecationWarning: Collection.count() is deprecated. Use countDocuments or estimatedDocumentCount.' 'ecosystem_type, Chemical products: 2' 'ecosystem_type, Deep subsurface: 17' 'ecosystem_type, Freshwater: 37' 'ecosystem_type, Roots: 128' 'ecosystem_type, Sand microcosm: 17' 'ecosystem_type, Soil: 548' 'ecosystem_type, Volcanic: 4'

The Biosample

{
  "_id": {
    "$oid": "634859eb604457c085089d23"
  },
  "id": "nmdc:a6c3b0d2-6f2e-4715-84d3-666719d1f54b",
  "name": "Bulk soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_115",
  "description": "Bulk soil microbial communities from the East River watershed near Crested Butte, Colorado, United States",
  "env_broad_scale": {
    "has_raw_value": "ENVO:00000108",
    "term": {
      "id": "ENVO:00000108"
    }
  },
  "env_local_scale": {
    "has_raw_value": "ENVO:00000292",
    "term": {
      "id": "ENVO:00000292"
    }
  },
  "env_medium": {
    "has_raw_value": "ENVO:00005802",
    "term": {
      "id": "ENVO:00005802"
    }
  },
  "type": "nmdc:Biosample",
  "collection_date": {
    "has_raw_value": "2017-03-07"
  },
  "depth": {
    "has_raw_value": "0.0",
    "has_numeric_value": 0,
    "has_unit": "meter",
    "has_maximum_numeric_value": 0.05
  },
  "geo_loc_name": {
    "has_raw_value": "USA: Colorado"
  },
  "lat_lon": {
    "has_raw_value": "38.917216053 -106.9559947",
    "latitude": 38.917216053,
    "longitude": -106.9559947
  },
  "ecosystem": "Environmental",
  "ecosystem_category": "Terrestrial",
  "ecosystem_type": "Soil",
  "ecosystem_subtype": "Meadow",
  "specific_ecosystem": "Bulk soil",
  "add_date": "2018-06-22",
  "community": "microbial communities",
  "habitat": "bulk soil",
  "location": "The East River watershed near Crested Butte, Colorado, USA",
  "mod_date": "2021-06-15",
  "ncbi_taxonomy_name": "soil metagenome",
  "sample_collection_site": "soil",
  "alternative_identifiers": [
    "gold:Gb0191643",
    "gold:Gb0205601",
    "img.taxon:3300042813"
  ],
  "gold_sample_identifiers": [
    "gold:Gb0191643"
  ],
  "insdc_biosample_identifiers": [
    "biosample:SAMN10864388"
  ],
  "sample_link": [
    "gold:Gs0135149"
  ],
  "samp_name": "ER_115",
  "igsn_biosample_identifiers": [
    "igsn:IEWFS0001"
  ]
}
mslarae13 commented 1 year ago

With discussion during the squad sync, I realized "has_value" is the minimum value and the depth2 is the max value.

The metadata is there, but need to be more clearly stated "has minimum value"

mslarae13 commented 3 days ago

@turbomam I think this is related to this. From the API, I see

results__depth__has_raw_value

And we WILL be keeping these 3 depth fields, yes?

Is this issue resolved? Is depth2 gone? It looks like it is gone from the schema, but I am still seeing it in... runtime? and documentation?