Open victorlin opened 2 years ago
Thank you writing this up, @victorlin! For the CLI, we should consider continuing to support separate --metadata
and --sequences
arguments vs. adding a --database
argument. The pandas read_csv
API that gets called behind the scenes to load metadata already supports a nice generic interface where any valid file path or URL can be passed as input. This interface allows us to run augur filter like:
augur filter \
--metadata s3://nextstrain-data/metadata.tsv.gz \
--output-metadata filtered_metadata.tsv
augur filter \
--metadata https://data.nextstrain.org/metadata.tsv.gz \
--output-metadata filtered_metadata.tsv
This kind of interface could be extended to support databases (and TSVs, etc.) like so:
augur filter \
--metadata metadata.sqlite3 \
--output-metadata filtered_metadata.tsv
augur filter \
--metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
--output-metadata filtered_metadata.tsv
Obviously the issue is complicated if we expect sequences and metadata to be in the same database to maintain internal consistency. The interface above could become frustrating, confusing, and/or inconsistent if users had to specify database paths multiple times like:
augur filter \
--metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
--sequences "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
--output-metadata filtered_metadata.tsv \
--output-sequences filtered_sequences.fasta
# Valid but potentially confusing if databases are not consistent.
augur filter \
--metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
--sequences sequences.sqlite3 \
--output-metadata filtered_metadata.tsv \
--output-sequences filtered_sequences.fasta
Whatever interface we choose, we should consider how/whether to abstract the type of input data from how it is being stored. For example, --database
could become --data
if it represents both sequences and metadata.
@victorlin, @tsibley, and I discussed this a bit previously, but we should also consider what internal interface we want for metadata
and sequences
, so we can represent metadata records in a way that doesn't couple the records to the storage medium. For example, do we want to always represent metadata records internally as pandas DataFrames (for the functionality that comes with them) even if the data are stored in a database? Or do we want to define our own Metadata
or MetadataRecord
classes to provide access to records? These don't have to be too complicated; they could be implemented as dataclasses.
We currently pass around pandas DataFrames internally, which strongly couples our internal use of metadata to how they are stored. It is easy to imagine changing all of this code to work with database query sets instead such that all of our code strongly coupled to the database implementation. We might consider using factories that implement methods to generate Metadata
and Sequence
records. Thinking about these abstractions now could free us in the future to switch underlying implementations or implement internal workarounds to issues with the APIs of the tools we depend on (e.g., BioPython, pandas, GFF parsers, etc.).
Update: after experimenting with loading sequence data into the database (06e56981432be868939b9d40cadb0c529e99e38d), it turns out to be unnecessary overhead for augur filter
and will come later. More info in Slack thread.
@huddlej
I like the idea of expanding --metadata
to handle databases. Though we are postponing sequences in database for now, this will come up eventually and your point about having both --metadata
and --sequences
is important to address. I don't think we should support two separate databases, that would complicate things and we'd want to JOIN
on sequence IDs within a database.
This sequence of features might work:
--metadata <sqlite_file>
<sqlite_file>
is a SQLite3 database file, detected by filename ending in .sqlite3
metadata
, which has rows/columns comparable to currently supported metadata files.--output-metadata <sqlite_file>
as the counterpart to (1).--data <sqlite_file>
<sqlite_file>
is the same thing as in (1).metadata
, same thing as in (1).sequences
, which contains 2 columns:
id
: sequence identifier (comparable to Bio.SeqRecord.id
)seq
: sequence identifier (comparable to Bio.SeqRecord.seq
)
--output-data <sqlite_file>
as the counterpart to (4).we should also consider what internal interface we want for
metadata
andsequences
, so we can represent metadata records in a way that doesn't couple the records to the storage medium
The work in #854 replaces the internal logic in augur filter
(that relies on passing around DataFrames) with reliance on intermediate tables having pre-defined names:
With this approach, the record ID column (e.g. strain
) is used for JOIN
s to pull info from other intermediate tables and create new tables. This is similar to what's being done currently:
I'm not familiar with other augur
subcommands yet, but maybe this approach can be used elsewhere.
I also thought about dataclasses/sqlalchemy in #854 but decided not to pursue the additional complexity. Would be happy to reconsider during review or as a next step. This would also help in supporting non-SQLite databases, since #854 is using raw SQL queries/commands with SQLite-specific syntax.
A couple thoughts, re-reading thru this now. I generally concur with most things above.
+1 for supporting remote URLs (s3://
, https://
, etc.), but though the interfaces overlap, I see implementing this as separate/distinct from supporting a SQLite database as input. Remote vs. local is an I/O layer transport detail (e.g. implementable with fsspec
) distinct from the format of the input (e.g. CSV vs. SQLite).
+1 for a first step of detecting local SQLite files given to --metadata
, though I would recommend using magic bytes (e.g. via the libmagic library or otherwise) in the SQLite header rather than file extensions, as these vary (.sqlite3
, .db
, nothing, etc).
I just wrote a comment summarizing the current state of things on augur filter --engine sqlite
: https://github.com/nextstrain/augur/pull/854#issuecomment-1302820016
Thinking more broadly, it might not be a bad idea to begin work on the database loading command in parallel to refining #854. This allows people to get a feel for how SQL queries work on metadata/sequences. This could look something like:
augur db load \
--metadata metadata.tsv \
--sequences sequences.fasta \
--to-sqlite3 data.db
sqlite3 data.db "SELECT * FROM metadata WHERE region == 'europe'" > metadata-subset.tsv
EDIT: Started on this in #1094.
@tsibley and I just discussed this in our 1:1 chat. Next steps:
Context
The increasing amount of data has shed light on limitations of the current file-based approach (#789). There have been solutions implemented, mostly in
augur filter
, to improve memory usage and optimize IO speed. There have also been some discussions of using a database to better handle large datasets, but no clear solution yet. More info on this Nextstrain team google doc.Assessment
Comparing a database approach against the current usage of pandas and Bio.SeqIO:
metadata.tsv
andsequences.fasta
..tsv
and.fasta
files (no regression in current interface).Proposed iterative solution
1. Use a database file within
augur filter
(#854)augur filter
.2. Support populating/loading an existing database file
Desired usage:
augur filter --metadata
to accepts a SQLite database file. If present, it will skip the metadata/sequence loading step.augur index
for sequences.3. Use
sqlite3
in other Augur commands that read tabular dataaugur filter
.4. Pass database files between Augur commands (?)
far out, likely to change
--output-data
toaugur filter
.--data
and/or--output-data
to other Augur commands.5. Support remote databases (?)
far out, likely to change
Steps with unclear placement in solution
Relevant past and present work
854
1094
1242