Open turbomam opened 6 months ago
Many BioSamples are annotated with their BioProject links though a structure like this:
<Link type="entrez" target="bioproject" label="PRJNA656268">656268</Link>
Which is extracted like this
let $bp_link := $bs/Links/Link[@type='entrez' and @target='bioproject']
let $bp_ids := fn:normalize-space(
string-join($bp_link,$delim))
The numerical BioProject identifiers (not the alphanumeric accession strings) are saved as the bp_id
column in Postgres table non_attribute_metadata
(which is also accessible from attributes_plus_view
). We could switch to or add the BioProject accessions, but the numerical ids are displayed towards the upper right on BioProject pages, in a string like
Accession: PRJNA656268 ID: 656268
The bp_id
values may be |||
delimited.
I don't know how thoroughly the BiosSamples are annotated like that, or whether we would need to load additional data from BioProject or SRA. OPr maybe an efetch
operation would be most efficient.
But for now, a report of the attributes of Biosamples from PRJNA656268 can be generated one the database is regenerated on my workstation overnight, without requiring any other resources, by search for 656268 in attributes_plus_view.bp_id
I would probably be a good idea to load that into pandas and remove empty columns.
See https://www.ncbi.nlm.nih.gov/bioproject/PRJNA656268