Superseded by https://github.com/turbomam/biosample-xmldb-sqldb (which, despite it's name, doesn't use an XML database)
Using XQueries in a BaseX database to convert NCBI's BioSample database from XML to SQLite.
Querying the native BioSample database requires familiarity with, and access to tools designed for XPath, XQuery or XSLT. The SQLite database generated by this software can be queried from Python programs or a multitude of commercial, open-source, CLI and GUI tools, using the most common query language. No citation provided :-)
When a scientist submits data (like taxonomic identification or gene expression) to NCBI, they are also required to submit metadata about the BioSamples from which that data was collected.
NCBI uses the term attribute to describe certain named information about the BioSamples. Submitters can tag this information about their BioSamples with any attribute names they choose. The NCBI performs curation of the BioSamples, and attributes that appear to share the same meaning as controlled terms from standards like the GSC's MIxS
are given harmonized names.
These BioSample attribute
s should not be confused with XML markup construct that is also called attribute
. In fact, the BioSample attributes are actually XML element
s (from the path BioSampleSet/BioSample/Attributes/Attribute
) although @attribute_name
and @harmonized_name
are XML attribute
s.
There are several other paths with element
s, and attribute
s of elements that describe the BioSamples. For example, see the Path Index section of reports/biosample_set_1_info_index.txt
As of early January, 2022, the number of XML nodes necessary to model the 22,786,924 BioSamples was 2,397,333,775. This repo uses the BaseX open-source XML database, which is feature rich and performs well. However, BaseX does have per-database limits, such as no more than 2,147,483,648 nodes per database. Therefore, NCBI's biosample_set.xml.gz is split into two chunks with sed
and then loaded in two BaseX databases. Queries written in the XQuery
language are then executed over the two databases, generating various TSV
files.
After running the XQueries, the generated TSV
s are loaded into a SQLite database. SQLite was selected because the entire, indexed and ready-to-use database can be shared as a single, compressible file and because of the plethora of compatible tools, which do not require setting up a persistent database server. Examples include Python's built-in sqlite module, the sqlite
command line client (which can be downloaded from https://www.sqlite.org/download.html or installed with package managers that are included in many Unix-like operating systems), or with a graphical tool like DBeaver.
Note: some column names like id
and temp
may be reserved words. These should be wrapped in double quotes when writing queries.
CREATE INDEX all_attribs_idx on all_attribs("harmonized_name", "raw_id", "attribute_name");
TEXT
unless otherwise specified):
INTEGER
)env_package
valuesMakefile
has a step to create a target/env_package_repair_new.tsv
. No tools are provided yet for automatically reconciling the new file with the previous curated file.TEXT
unless otherwise specified):
INTEGER
)TEXT
, in wide format. Also includes the BioSample raw_id
s, as INTEGER
s.raw_id
raw_id_idx
and env_package
. More could be added for better queries at the cost of disk footprint.harmonized_wide
. Serves as a place to collect values that have gone though some cleanup/repair process. Currently env_package
is the only column repaired by this repo, by way of the env_package_repair
table.raw_id_idx
DOI
s have been omitted , since they are only available for ~ 400 out of the 20+ million BioSamples.harmonized_wide
rows as some BioSamples don't have any attributes with harmonized names.TEXT
unless otherwise indicated):
biosample_set.xml
file. Built from the prefix "BIOSAMPLE:" and the primary_id
)INTEGER
, primary key)TEXT
, they are numerical BioProject IDs and should really be joined with the BioProject XML database to get BioProject accessions.)non_attribute_metadata
+ harmonized_wide
tablesAs of 2022-01-03
BaseX and SQLite can be installed with homebrew on Macs or apt-get
on Ubuntu Linux machines. They're both open-source software. SQLite3 3.32 or greater is required.
Downloading the BaseX .zip
archive makes it a little more straightforward to increase the amount of memory allocated by the launch scripts. For example, on a 32 GB machine, basex/bin/basex
might look like this:
#!/usr/bin/env bash
# Path to this script
FILE="${BASH_SOURCE[0]}"
while [ -h "$FILE" ] ; do
SRC="$(readlink "$FILE")"
FILE="$( cd -P "$(dirname "$FILE")" && \
cd -P "$(dirname "$SRC")" && pwd )/$(basename "$SRC")"
done
MAIN="$( cd -P "$(dirname "$FILE")/.." && pwd )"
# Core and library classes
CP=$MAIN/BaseX.jar:$MAIN/lib/custom/*:$MAIN/lib/*:$CLASSPATH
# Options for virtual machine (can be extended by global options)
BASEX_JVM="-Xmx24g $BASEX_JVM"
# Run code
exec java -cp "$CP" $BASEX_JVM org.basex.BaseX "$@"
https://docs.basex.org/wiki/Main_Page
make all
takes roughly 8 hours.
cd /global/cfs/cdirs/m3513/endurable/biosample
git clone git@github.com:turbomam/biosample-basex.git
wget
the latest BaseX zip archive
wget https://files.basex.org/releases/9.6.4/BaseX964.zip
unzip BaseX964.zip
basex/bin/basex
BASEX_JVM="-Xmx96g $BASEX_JVM"
cp biosample-basex/template.env biosample-basex/.env
biosample-basex/.env
BASEXCMD = ../basex/bin/basex
del_from
should be roughly half the number of BioSamples. The suggested value of 12500001 should be reasonable but could be revised after downloading and unpacking the BioSample XML file. Run something like tail -n 50 target
at the shell prompt, note the id
attribute of the last <BioSample>
, and set del_from
to an integer close to one half of that last id
value. screen
. If you get disconnected, it may be tricky to log back into the same cori node because of load balancing, but scripts started before the disconnection should run to completion. It doesn't hurt to force your client computer to stay awake with something like caffeine
.module load python
pandas
available. On other systems, users should create a venv
virtual environment, enter it, and run pip install -r requirments.txt
biosample-basex/
make all
After running make final_sqlite_gz_dest
, the SQLite will be available at https://portal.nersc.gov/project/m3513/biosample/