turbomam / biosample-xmldb-sqldb

Tools for loading NCBI Biosample into an XML database and then transforming that into a SQL database
MIT License
0 stars 1 forks source link

Despite it's name, this repo does not use an XML database. See older https://github.com/turbomam/biosample-basex for that.

See also attempts to load NCBI Biosample into MongoDB (from which tabular representatiosn could also be extracted)

biosample-xmldb-sqldb

This repository provides an end-to-end pipeline to extract NCBI BioSample data from XML, transform it, and load into a normalized PostgreSQL database for analytics and project-specific ETL, like instantiating NMDC Biosample objects.

Don't forget to look though (or contribute to) the issues and TODOs below

Codebase and Architecture

The repository contains:

The key components are:

The key scripts are:

biosample_xml_to_relational.py

streaming_pivot_bisample_id_chunks.py

Database Model

The key tables populated are:

ncbi_attributes_all_long

Long format table with one row for each NCBI attribute of each Biosample.

ncbi_attributes_harmonized_wide

A pivot of ncbi_attributes_all_long with columns for each harmonized attribute and one row per Biosample. The values include units if available.

non_attribute_metadata

One row of metadata per Biosample. These columns are populated from XML paths other than Biosample/Attributes/Attribute

ncbi_attributes_harmonized_wide

This view combines the two tables described above, avoiding the need to explicitly join them in analytical queries. It is assumed that most ETLs will use this view as input.

Overview

The pipeline includes the following phases:

Getting Started

The pipeline is orchestrated end-to-end via make, with targets for each stage.

Requirements

The full pipeline flow and additional options are documented in the Makefile.

Don't forget to run scripts inside a screen session

The main pipeline tasks are defined in the Makefile and can be run as:

make downloads/biosample_set.xml # Download the BioSample XML dataset
make postgres-up                 # Start PostgreSQL container  
make postgres-create             # Create DB and user
make postgres-load               # Load data from BioSample XML    
make postgres-pivot              # Pivot attribute data
make postgres-post-pivot-view    # Create a view of the  pivoted data

The pipeline uses a local/.env file for Postgres credentials and connections string. A template is provided.

The Makefile downloads all of NCBI's BioSample collection and unpacks it, using ~ 100 GB of disk space. The max-biosamples option provides support for a quick test/partial load mode. The entire build takes ~ 24 hours and requires an additional ~ 100 GB for the PostgreSQL tables and indices.

Discussion of NCBI "Attributes" and XML "Attributes"

You will see mentions of the word attributes in two separate and unfortunately confusing contexts: "NCBI Attributes" refers to specific XML paths NCBI has used under the BioSample records. "XML attributes" refers to the very basic feature offered by the XML language, in which key/value pairs are asserted within a node's opening tag.

The BioSample XML structure contains a distinction between:

NCBI Attributes

XML Attributes

Performance notes

Run these commands from the root of the repo.

date && time grep -c '<BioSample' downloads/biosample_set.xml

Thu Feb 22 17:20:39 UTC 2024 37572120

real 6m41.464s user 0m46.455s sys 0m33.311s

tail -n 100 downloads/biosample_set.xml | grep '<BioSample'

<BioSample access="public" publication_date="2024-02-22T00:00:00.000" last_update="2024-02-22T01:55:09.056" submission_date="2024-02-22T01:55:09.056" 
id="40028294" accession="SAMN40028294">
<BioSample access="public" publication_date="2024-02-22T00:00:00.000" last_update="2024-02-22T02:28:19.950" submission_date="2024-02-22T02:22:05.510" 
id="40028511" accession="SAMN40028511">

time make postgres-load

2024-02-21 18:52:52,538 Processed 37,550,001 to 37,551,000 of 50,000,000 biosamples (75.10%) 2024-02-21 18:52:53,673 Done parsing biosamples Elapsed time: 60675.36360359192 seconds 2024-02-21 18:52:55,738 Done parsing biosamples

real 1011m25.977s user 970m59.386s sys 1m50.840s

The entire build of a 105 GB/35 Million BioSample XML dataset takes approximately 2 days. The resulting Postgres database can be dumped with pg_dump -Fc to a roughly 4 GB file.

Assets

A dump of the database can be downloaded from https://portal.nersc.gov/project/m3513/biosample/?C=M;O=D

The direct link is https://portal.nersc.gov/project/m3513/biosample/biosample_postgres_20240226.dmp.gz

People who have NERSC credentials can run live SQL queries against an instance of this database maintained by the National Microbiome Data Collaborative

Users visiting the NERSC-hosted database may see additonal tables containing metadata about MIxS and EnvO terms. Documetnation is pending. See issue #18

Additional resources

This repo captures Biosample data from NCBI. Nominally, it (and EBI BioSamples) follow The Genomic Standards Consortium’s MIxS standards. In reality, all three have significant differences in what attributes they expect for Biosamples of different types, and they take different approaches to validating the attribute values.

Compared to previous methods using the BaseX XML database

Advantages

Limitations

TODOs:

See also https://github.com/turbomam/biosample-xmldb-sqldb/issues

References