nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 127 forks source link

Support databases #866

Open victorlin opened 2 years ago

victorlin commented 2 years ago

Context

The increasing amount of data has shed light on limitations of the current file-based approach (#789). There have been solutions implemented, mostly in augur filter, to improve memory usage and optimize IO speed. There have also been some discussions of using a database to better handle large datasets, but no clear solution yet. More info on this Nextstrain team google doc.

Assessment

Comparing a database approach against the current usage of pandas and Bio.SeqIO:

Proposed iterative solution

1. Use a database file within augur filter (#854)

2. Support populating/loading an existing database file

Desired usage:

augur load \
  --metadata metadata.tsv
  --output-metadata data.sqlite3
augur filter –-metadata data.sqlite3 ...

3. Use sqlite3 in other Augur commands that read tabular data

4. Pass database files between Augur commands (?)

far out, likely to change

augur filter –-data data.sqlite3 ... --output-data data-filtered.sqlite3
...
augur refine --data data-filtered.sqlite3 ...
...

5. Support remote databases (?)

far out, likely to change

augur filter –-database "postgresql://user:pass@some.postgresql.endpoint:5432/database" ...

Steps with unclear placement in solution

Relevant past and present work

huddlej commented 2 years ago

Thank you writing this up, @victorlin! For the CLI, we should consider continuing to support separate --metadata and --sequences arguments vs. adding a --database argument. The pandas read_csv API that gets called behind the scenes to load metadata already supports a nice generic interface where any valid file path or URL can be passed as input. This interface allows us to run augur filter like:

augur filter \
  --metadata s3://nextstrain-data/metadata.tsv.gz \
  --output-metadata filtered_metadata.tsv

augur filter \
  --metadata https://data.nextstrain.org/metadata.tsv.gz \
  --output-metadata filtered_metadata.tsv

This kind of interface could be extended to support databases (and TSVs, etc.) like so:

augur filter \
  --metadata metadata.sqlite3 \
  --output-metadata filtered_metadata.tsv

augur filter \
  --metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
  --output-metadata filtered_metadata.tsv

Obviously the issue is complicated if we expect sequences and metadata to be in the same database to maintain internal consistency. The interface above could become frustrating, confusing, and/or inconsistent if users had to specify database paths multiple times like:

augur filter \
  --metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
  --sequences "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
  --output-metadata filtered_metadata.tsv \
  --output-sequences filtered_sequences.fasta

# Valid but potentially confusing if databases are not consistent.
augur filter \
  --metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
  --sequences sequences.sqlite3 \
  --output-metadata filtered_metadata.tsv \
  --output-sequences filtered_sequences.fasta

Whatever interface we choose, we should consider how/whether to abstract the type of input data from how it is being stored. For example, --database could become --data if it represents both sequences and metadata.

@victorlin, @tsibley, and I discussed this a bit previously, but we should also consider what internal interface we want for metadata and sequences, so we can represent metadata records in a way that doesn't couple the records to the storage medium. For example, do we want to always represent metadata records internally as pandas DataFrames (for the functionality that comes with them) even if the data are stored in a database? Or do we want to define our own Metadata or MetadataRecord classes to provide access to records? These don't have to be too complicated; they could be implemented as dataclasses.

We currently pass around pandas DataFrames internally, which strongly couples our internal use of metadata to how they are stored. It is easy to imagine changing all of this code to work with database query sets instead such that all of our code strongly coupled to the database implementation. We might consider using factories that implement methods to generate Metadata and Sequence records. Thinking about these abstractions now could free us in the future to switch underlying implementations or implement internal workarounds to issues with the APIs of the tools we depend on (e.g., BioPython, pandas, GFF parsers, etc.).

victorlin commented 2 years ago

Update: after experimenting with loading sequence data into the database (06e56981432be868939b9d40cadb0c529e99e38d), it turns out to be unnecessary overhead for augur filter and will come later. More info in Slack thread.

victorlin commented 2 years ago

@huddlej

I like the idea of expanding --metadata to handle databases. Though we are postponing sequences in database for now, this will come up eventually and your point about having both --metadata and --sequences is important to address. I don't think we should support two separate databases, that would complicate things and we'd want to JOIN on sequence IDs within a database.

This sequence of features might work:

  1. Support --metadata <sqlite_file>
    • <sqlite_file> is a SQLite3 database file, detected by filename ending in .sqlite3
    • The database must contain a table metadata, which has rows/columns comparable to currently supported metadata files.
  2. Support --output-metadata <sqlite_file> as the counterpart to (1).
  3. Support loading sequences into database.
  4. Support --data <sqlite_file>
    • <sqlite_file> is the same thing as in (1).
    • The database must contain a table metadata, same thing as in (1).
    • The database must contain a table sequences, which contains 2 columns:
      • id: sequence identifier (comparable to Bio.SeqRecord.id)
      • seq: sequence identifier (comparable to Bio.SeqRecord.seq)
        • This should support compression of some sort, though we want to be careful on how to support/determine various compression methods.
  5. Support --output-data <sqlite_file> as the counterpart to (4).

we should also consider what internal interface we want for metadata and sequences, so we can represent metadata records in a way that doesn't couple the records to the storage medium

The work in #854 replaces the internal logic in augur filter (that relies on passing around DataFrames) with reliance on intermediate tables having pre-defined names:

https://github.com/nextstrain/augur/blob/df83e1ba31cfb8290ff0316da6ab6755f490c661/augur/filter_support/db/sqlite.py#L18-L25

With this approach, the record ID column (e.g. strain) is used for JOINs to pull info from other intermediate tables and create new tables. This is similar to what's being done currently:

https://github.com/nextstrain/augur/blob/b9e0c7908106fda0d48a3e06d01e035f1c328384/augur/filter.py#L1400-L1404

I'm not familiar with other augur subcommands yet, but maybe this approach can be used elsewhere.

I also thought about dataclasses/sqlalchemy in #854 but decided not to pursue the additional complexity. Would be happy to reconsider during review or as a next step. This would also help in supporting non-SQLite databases, since #854 is using raw SQL queries/commands with SQLite-specific syntax.

tsibley commented 2 years ago

A couple thoughts, re-reading thru this now. I generally concur with most things above.

+1 for supporting remote URLs (s3://, https://, etc.), but though the interfaces overlap, I see implementing this as separate/distinct from supporting a SQLite database as input. Remote vs. local is an I/O layer transport detail (e.g. implementable with fsspec) distinct from the format of the input (e.g. CSV vs. SQLite).

+1 for a first step of detecting local SQLite files given to --metadata, though I would recommend using magic bytes (e.g. via the libmagic library or otherwise) in the SQLite header rather than file extensions, as these vary (.sqlite3, .db, nothing, etc).

victorlin commented 1 year ago

I just wrote a comment summarizing the current state of things on augur filter --engine sqlite: https://github.com/nextstrain/augur/pull/854#issuecomment-1302820016

Thinking more broadly, it might not be a bad idea to begin work on the database loading command in parallel to refining #854. This allows people to get a feel for how SQL queries work on metadata/sequences. This could look something like:

augur db load \
  --metadata metadata.tsv \
  --sequences sequences.fasta \
  --to-sqlite3 data.db

sqlite3 data.db "SELECT * FROM metadata WHERE region == 'europe'" > metadata-subset.tsv

EDIT: Started on this in #1094.

victorlin commented 6 months ago

@tsibley and I just discussed this in our 1:1 chat. Next steps:

  1. Continue #1242
    1. Create separate internal databases for input metadata table and intermediate tables.
    2. Test with large datasets.
  2. Pick back up on CSV↔SQLite subcommand (#1094).
  3. Look into adding sequences (with compression).