turbomam / biosample-basex

Using the Base-X XML database to discover structure in NCBI's Biosample database
1 stars 0 forks source link

Superseded by https://github.com/turbomam/biosample-xmldb-sqldb (which, despite it's name, doesn't use an XML database)

biosample-basex

Using XQueries in a BaseX database to convert NCBI's BioSample database from XML to SQLite.

Querying the native BioSample database requires familiarity with, and access to tools designed for XPath, XQuery or XSLT. The SQLite database generated by this software can be queried from Python programs or a multitude of commercial, open-source, CLI and GUI tools, using the most common query language. No citation provided :-)

BioSample metadata background

When a scientist submits data (like taxonomic identification or gene expression) to NCBI, they are also required to submit metadata about the BioSamples from which that data was collected.

NCBI uses the term attribute to describe certain named information about the BioSamples. Submitters can tag this information about their BioSamples with any attribute names they choose. The NCBI performs curation of the BioSamples, and attributes that appear to share the same meaning as controlled terms from standards like the GSC's MIxS are given harmonized names.

These BioSample attributes should not be confused with XML markup construct that is also called attribute. In fact, the BioSample attributes are actually XML elements (from the path BioSampleSet/BioSample/Attributes/Attribute) although @attribute_name and @harmonized_name are XML attributes.

There are several other paths with elements, and attributes of elements that describe the BioSamples. For example, see the Path Index section of reports/biosample_set_1_info_index.txt

BaseX database limits

As of early January, 2022, the number of XML nodes necessary to model the 22,786,924 BioSamples was 2,397,333,775. This repo uses the BaseX open-source XML database, which is feature rich and performs well. However, BaseX does have per-database limits, such as no more than 2,147,483,648 nodes per database. Therefore, NCBI's biosample_set.xml.gz is split into two chunks with sed and then loaded in two BaseX databases. Queries written in the XQuery language are then executed over the two databases, generating various TSV files.

SQLite output

After running the XQueries, the generated TSVs are loaded into a SQLite database. SQLite was selected because the entire, indexed and ready-to-use database can be shared as a single, compressible file and because of the plethora of compatible tools, which do not require setting up a persistent database server. Examples include Python's built-in sqlite module, the sqlite command line client (which can be downloaded from https://www.sqlite.org/download.html or installed with package managers that are included in many Unix-like operating systems), or with a graphical tool like DBeaver.

Tables in the SQLite database

Note: some column names like id and temp may be reserved words. These should be wrapped in double quotes when writing queries.

As of 2022-01-03

Usage

BaseX and SQLite can be installed with homebrew on Macs or apt-get on Ubuntu Linux machines. They're both open-source software. SQLite3 3.32 or greater is required.

Downloading the BaseX .zip archive makes it a little more straightforward to increase the amount of memory allocated by the launch scripts. For example, on a 32 GB machine, basex/bin/basex might look like this:

#!/usr/bin/env bash

# Path to this script
FILE="${BASH_SOURCE[0]}"
while [ -h "$FILE" ] ; do
  SRC="$(readlink "$FILE")"
  FILE="$( cd -P "$(dirname "$FILE")" && \
           cd -P "$(dirname "$SRC")" && pwd )/$(basename "$SRC")"
done
MAIN="$( cd -P "$(dirname "$FILE")/.." && pwd )"

# Core and library classes
CP=$MAIN/BaseX.jar:$MAIN/lib/custom/*:$MAIN/lib/*:$CLASSPATH

# Options for virtual machine (can be extended by global options)
BASEX_JVM="-Xmx24g $BASEX_JVM"

# Run code
exec java -cp "$CP" $BASEX_JVM org.basex.BaseX "$@"

https://docs.basex.org/wiki/Main_Page

https://basex.org/download/

NERSC cori specific notes

make all takes roughly 8 hours.

After running make final_sqlite_gz_dest, the SQLite will be available at https://portal.nersc.gov/project/m3513/biosample/