microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

Brodie Biosample names missing local names after merging DNA and RNA GOLD biosamples #340

Closed dehays closed 3 years ago

dehays commented 3 years ago

The primary issue here is that the biosample names (which originate from GOLD) are truncated in the search portal.

Eoin Brodie pointed out in a call yesterday that there was no way to differentiate between the different samples in the UI because the part of the biosample name that is different - has been truncated.

For example, GOLD appears to build the biosample name by appending the local sample name to the end of the study name; i.e. "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_115"

But in most cases - all of the Brodie biosamples display "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -" as the sample names.

First and easier step to address this: display the entire biosample name

In future, consider adopting a different sample naming scheme than GOLD's.

jbeezley commented 3 years ago

If your screen is wide enough the full name is shown. Maybe we need to make the name multi-line rather than truncating (with an ellipsis) when it doesn't fit?

jbeezley commented 3 years ago

Oh wait, I guess for the Brodie biosamples the name is like that in the database. I assume this is coming from upstream in the pipeline because nothing in the ingest does any truncation.

dehays commented 3 years ago

You're right @jbeezley - I see it in the Mongo documents:

{"_id":
{"$oid":"602551d125261d62add15a31"},
"name":"Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ",

"description":"Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States",

"lat_lon":{"has_raw_value":"38.9206 -106.9489",
    "latitude":38.9206,
    "longitude":-106.9489},

...

I would agree making the display wrap to multi lines so long names display regardless of window width.

I'll move this to the ETL issues as I think that is where the truncation must be happening.

dehays commented 3 years ago

@wdduncan It appears that this biosample name truncation is happening in the ETL. The example above is truncated at 103 characters. (Maybe when you load from Oracle to your local DB?)

wdduncan commented 3 years ago

The json that I output has more data than what is shown above. Here is the json for one Brodie's (Gs0135149) study biosamples (note the name has - ER_DNA_115).

{
      "id": "gold:Gb0191643",
      "name": "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_115",
      "description": "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States",
      "type": "nmdc:Biosample",
      "collection_date": {
        "has_raw_value": "2017-03-07"
      },
      "lat_lon": {
        "has_raw_value": "38.9206 -106.9489",
        "latitude": 38.9206,
        "longitude": -106.9489
  }
}

The name matches what is on the GOLD portal for biosample Gb0191643 (see screenshot). image

dehays commented 3 years ago

This is kinda yuck. @wdduncan - this is not truncation happening in the GOLD ETL as I had originally thought.

For the Brodie study, there are 53 biosample metadata documents in Mongo and as expected 53 biosamples on the search portal. Bill, your ETL produces nearly twice that number. This is because the RNA and DNA samples were merged into single source samples. @dwinston - the naming from that merge appears to be the shared part of the name but doesn't include the different part of the name. (Makes sense, there'd need to be a special rule to do something useful with the different parts of the names.) So two samples from GOLD, Gb0191643 and Gb0205601 with names "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_115" and "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_RNA_115", get merged to one sample igsn:IEWFS0001 with name "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -".

These look truncated but are really the result of that 2:1 biosample transformation. And this happen for each sample that has both a DNA and an RNA biosample in GOLD. (There are only three GOLD DNA biosamples for this study that had no corresponding RNA biosample and those appear correctly in the portal ...ER_DNA_379, ..._ER_DNA_380 and ...ER_DNA_381.

Something similar happens for the Organic Matter samples that have not corresponding GOLD biosample record; e.g. igsn:IEWFS000K which ends up with name: "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -"

Yet again - in this case, we know what happened, but there is no general solution beyond these specific samples. The local names for the samples were numbers 115 - 381. The DNA and RNA isolations got "ERDNA" and "ERRNA" prefixes for the local names for samples provided to JGI. EMSL got mostly the numbered samples - except for a set of samples used for metabolomics that had completely different naming.

I don't see any way for a transform to set appropriate sample names except as a a one-off that has knowledge of the local naming schemes.

jbeezley commented 3 years ago

Some do have that extra text, some don't. These are the entities in question:

nmdc> select id, name from biosample where name like '%- '                                                                     
+----------------+---------------------------------------------------------------------------------------------------------+
| id             | name                                                                                                    |
|----------------+---------------------------------------------------------------------------------------------------------|
| igsn:IEWFS0001 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0002 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0003 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0004 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0005 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0006 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0007 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0008 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0009 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000C | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000D | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000E | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000F | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000G | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000H | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000L | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000M | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000N | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000O | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000P | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000Q | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000R | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000S | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000T | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000U | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000V | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000W | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000X | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000Y | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000Z | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0010 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0011 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0012 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0013 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0014 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0015 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0016 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0017 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0018 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0019 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001A | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001B | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001C | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001D | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001E | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000I | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000K | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000B | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000A | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000J | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
+----------------+---------------------------------------------------------------------------------------------------------+
dehays commented 3 years ago

@jbeezley See my explanation - each merged RNA and DNA GOLD biosample and each EMSL only sample ends up looking truncated.

wdduncan commented 3 years ago

@dehays glad to know it is not a GOLD ETL issue. But, I'm not sure what the right approach to take is.

dehays commented 3 years ago

@dwinston You addressed this in the changes you made while meeting with Bill and I last Thursday. Can you close this with the PR for those changes.

dwinston commented 3 years ago

documented in microbiomedata/nmdc-runtime/metadata-translation/notebooks/202106_curation_updates.ipynb