turbomam / biosample-xmldb-sqldb

Tools for loading NCBI Biosample into an XML database and then transforming that into a SQL database
MIT License
0 stars 1 forks source link

Request for PRJNA656268 metadata from Adam Martiny via Montana #16

Open turbomam opened 6 months ago

turbomam commented 6 months ago

See https://www.ncbi.nlm.nih.gov/bioproject/PRJNA656268

turbomam commented 6 months ago

Many BioSamples are annotated with their BioProject links though a structure like this:

<Link type="entrez" target="bioproject" label="PRJNA656268">656268</Link>

Which is extracted like this

let $bp_link := $bs/Links/Link[@type='entrez' and @target='bioproject']

let $bp_ids := fn:normalize-space(
  string-join($bp_link,$delim))

The numerical BioProject identifiers (not the alphanumeric accession strings) are saved as the bp_id column in Postgres table non_attribute_metadata (which is also accessible from attributes_plus_view). We could switch to or add the BioProject accessions, but the numerical ids are displayed towards the upper right on BioProject pages, in a string like

Accession: PRJNA656268 ID: 656268

The bp_id values may be ||| delimited.

I don't know how thoroughly the BiosSamples are annotated like that, or whether we would need to load additional data from BioProject or SRA. OPr maybe an efetch operation would be most efficient.

turbomam commented 6 months ago

But for now, a report of the attributes of Biosamples from PRJNA656268 can be generated one the database is regenerated on my workstation overnight, without requiring any other resources, by search for 656268 in attributes_plus_view.bp_id

I would probably be a good idea to load that into pandas and remove empty columns.