Despite it's name, this repo does not use an XML database. See older https://github.com/turbomam/biosample-basex for that.
See also attempts to load NCBI Biosample into MongoDB (from which tabular representatiosn could also be extracted)
This repository provides an end-to-end pipeline to extract NCBI BioSample data from XML, transform it, and load into a normalized PostgreSQL database for analytics and project-specific ETL, like instantiating NMDC Biosample objects.
Don't forget to look though (or contribute to) the issues and TODOs below
The repository contains:
The key components are:
Makefile
- Defines pipeline tasks and orchestration logicbiosample_xmldb_sqldb/biosample_xml_to_relational.py
- Extracts BioSample data and loads into normalized tablesbiosample_xmldb_sqldb/streaming_pivot_bisample_id_chunks.py
- Pivots attribute data into a wide formatThe key scripts are:
ncbi_attributes_all_long
and non_attribute_metadata
tables raw_id
ncbi_attributes_harmonized_wide
tableThe key tables populated are:
Long format table with one row for each NCBI attribute of each Biosample.
A pivot of ncbi_attributes_all_long
with columns for each harmonized attribute and one row per Biosample. The values include units if available.
One row of metadata per Biosample. These columns are populated from XML paths other than Biosample/Attributes/Attribute
This view combines the two tables described above, avoiding the need to explicitly join them in analytical queries. It is assumed that most ETLs will use this view as input.
The pipeline includes the following phases:
The pipeline is orchestrated end-to-end via make
, with targets for each stage.
poetry install
The full pipeline flow and additional options are documented in the Makefile.
Don't forget to run scripts inside a screen session
The main pipeline tasks are defined in the Makefile and can be run as:
make downloads/biosample_set.xml # Download the BioSample XML dataset
make postgres-up # Start PostgreSQL container
make postgres-create # Create DB and user
make postgres-load # Load data from BioSample XML
make postgres-pivot # Pivot attribute data
make postgres-post-pivot-view # Create a view of the pivoted data
The pipeline uses a local/.env
file for Postgres credentials and connections string. A template is provided.
The Makefile downloads all of NCBI's BioSample collection and unpacks it, using ~ 100 GB of disk space. The max-biosamples
option provides support for a quick test/partial load mode. The entire build takes ~ 24 hours and requires an additional ~ 100 GB for the PostgreSQL tables and indices.
You will see mentions of the word attributes in two separate and unfortunately confusing contexts: "NCBI Attributes" refers to specific XML paths NCBI has used under the BioSample records. "XML attributes" refers to the very basic feature offered by the XML language, in which key/value pairs are asserted within a node's opening tag.
The BioSample XML structure contains a distinction between:
<Attribute>
nodes in a Biosample's <Attributes>
path ncbi_attributes_all_long
tableid
, name
, url
, etc.Id
, Link
, etc. as well as the non_attribute_metadata
tableRun these commands from the root of the repo.
date && time grep -c '<BioSample' downloads/biosample_set.xml
Thu Feb 22 17:20:39 UTC 2024 37572120
real 6m41.464s user 0m46.455s sys 0m33.311s
tail -n 100 downloads/biosample_set.xml | grep '<BioSample'
<BioSample access="public" publication_date="2024-02-22T00:00:00.000" last_update="2024-02-22T01:55:09.056" submission_date="2024-02-22T01:55:09.056"
id="40028294" accession="SAMN40028294">
<BioSample access="public" publication_date="2024-02-22T00:00:00.000" last_update="2024-02-22T02:28:19.950" submission_date="2024-02-22T02:22:05.510"
id="40028511" accession="SAMN40028511">
time make postgres-load
2024-02-21 18:52:52,538 Processed 37,550,001 to 37,551,000 of 50,000,000 biosamples (75.10%) 2024-02-21 18:52:53,673 Done parsing biosamples Elapsed time: 60675.36360359192 seconds 2024-02-21 18:52:55,738 Done parsing biosamples
real 1011m25.977s user 970m59.386s sys 1m50.840s
The entire build of a 105 GB/35 Million BioSample XML dataset takes approximately 2 days. The resulting Postgres database can be dumped with pg_dump -Fc
to a roughly 4 GB file.
A dump of the database can be downloaded from https://portal.nersc.gov/project/m3513/biosample/?C=M;O=D
The direct link is https://portal.nersc.gov/project/m3513/biosample/biosample_postgres_20240226.dmp.gz
People who have NERSC credentials can run live SQL queries against an instance of this database maintained by the National Microbiome Data Collaborative
-i ~/.ssh/nersc
and you will be asked for your password and one-time code<NERSC_USERNAME>
with the username NERSC assigned to you below<POSTGRES_PASSWORD>
with the appropriate password, obtained from @turbomampsql postgres://biosample_guest:<POSTGRES_PASSWORD>@localhost:15432
Users visiting the NERSC-hosted database may see additonal tables containing metadata about MIxS and EnvO terms. Documetnation is pending. See issue #18
This repo captures Biosample data from NCBI. Nominally, it (and EBI BioSamples) follow The Genomic Standards Consortium’s MIxS standards. In reality, all three have significant differences in what attributes they expect for Biosamples of different types, and they take different approaches to validating the attribute values.
See also https://github.com/turbomam/biosample-xmldb-sqldb/issues
harmonized_name
attributes. For example, converting all depth
s into a single numeric column with a consistent unit. May benefit from quantulum3.env_broad_scale
model
, package
and package_name
to match MIxS Extensions and Checklistsbioproject.xml
or some SRA metadata?BioSample/Description/Comment/Table
paths into something that could go into Postgres
biosample_xml_to_relational.py
and streaming_pivot_bisample_id_chunks.py
be interleaved?|||
delimiter when searching columns like bp_id
sql/experimental-factor-fts-query.sql
local/.env
and at top of Makefile