ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
369 stars 41 forks source link

Include more metadata from genbank files in virus reports, e.g. `/note` and `/strain` #338

Closed corneliusroemer closed 7 months ago

corneliusroemer commented 8 months ago

Quite frequently, valuable metadata is contained in the genbank file field '/note`.

Unfortunately, this field seems to get lost on the way to 'datasets download virus genome'

Consider the metadata available for the genbank file under SOURCE:

FEATURES             Location/Qualifiers
     source          1..2408
                     /organism="Zaire ebolavirus"
                     /mol_type="genomic RNA"
                     /strain="Mayinga 1976"
                     /db_xref="taxon:[186538](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=186538)"
                     /note="subtype: Zaire"

with what ends up in datasets download virus genome taxon:

{
  "accession": "U28077.1",
  "completeness": "PARTIAL",
  "isAnnotated": true,
  "length": 2408,
  "nucleotide": {
    "sequenceHash": "62B2211F"
  },
  "proteinCount": 2,
  "releaseDate": "1995-10-26T00:00:00Z",
  "sourceDatabase": "GenBank",
  "submitter": {
    "affiliation": "Anthony Sanchez, Special Pathogens Branch, Division of Viral and Rickettsial Diseases, Centers for Disease Control and Prevention, 1600 Clifton Road, Blgd. 15, Room SB611, Atlanta, GA 30333",
    "country": "USA",
    "names": [
      "Sanchez,A.",
      "Trappier,S.G.",
      "Mahy,B.W.",
      "Peters,C.J.",
      "Nichol,S.T."
    ]
  },
  "updateDate": "2002-08-28T00:00:00Z",
  "virus": {
    "organismName": "Zaire ebolavirus",
    "taxId": 186538
  }

Valuable information is lost:

This is probably not even such a good example, I can think of more important notes but couldn't find an example just now.

It would be nice, if all this metadata was passed through.

In fact, it might be a bug that molType is missing, as that is a field that should already be output per the schema here: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/data-reports/virus/

image
olearyna commented 7 months ago

Hi corneliusroemer,

Thank you for your suggestions. We are currently reviewing your metadata requests in collaboration with the NCBI Virus team. We will resolve any issues on our end. However, some metadata requests might require coordination with the NCBI Virus team. I will update you once we start working on this.

All the best,

Nuala

Nuala A. O'Leary, PhD Product Owner, NCBI Datasets National Center for Biotechnology Information, NLM, NIH, DHHS

olearyna commented 7 months ago

Hi corneliusroemer,

I discussed your request with the NCBI Virus group. There are no current plans to pull data from the /note section of the GenBank record but they will look into it. Any updates they make will be picked up by NCBI Datasets. You can contact the NCBI Virus group through the general NCBI feedback form https://support.nlm.nih.gov/support/create-case/.

Thanks, Nuala

dandaman commented 4 months ago

Any news on the integration of the /mol_type --> "molType" ? Or are there other ways to infer these from taxonomy data? I'd hate to be forced to download Genbank format as well in the future...

olearyna commented 4 months ago

Hi dandaman,

We don't have moltype in the virus report yet but you can get it from the taxonomy data report for any tax id.

Here is the command using dataformat to get the taxid from the virus report

datasets summary virus genome accession U28077.1 --as-json-lines | dataformat tsv virus-genome --fields virus-tax-id --elide-header
186538

Here is the command to get the moltype from the taxonomy report using jq

datasets summary taxonomy taxon 186538 | jq -r .reports[].taxonomy.genomic_moltype
ssRNA(-)

Let me know if you have any questions.

Nuala

dandaman commented 4 months ago

Dear @olearyna,

that is perfect, thank you :-)

Best, Daniel