Support databases - Githubissues

Context

The increasing amount of data has shed light on limitations of the current file-based approach (#789). There have been solutions implemented, mostly in augur filter, to improve memory usage and optimize IO speed. There have also been some discussions of using a database to better handle large datasets, but no clear solution yet. More info on this Nextstrain team google doc.

Assessment

Comparing a database approach against the current usage of pandas and Bio.SeqIO:

Strengths:
- No need to load all the data into memory at once, load in chunks where a "global" view of data is lost, or maintain a partial "global" view (e.g. for de-duplication).
- Ability to read data once and query multiple times, in separate commands.
- Many ways to further optimize a SQL query (e.g. indexes).
- Scales well to large datasets.
Caveats:
- A database is much less portable than metadata.tsv and sequences.fasta.
  - At least during the transition, there should be the ability to import/export database contents as .tsv and .fasta files (no regression in current interface).
- Some initial logic in #854 requires data to be loaded into tables with certain names.
  - For now, table names are hardcoded as implementation details but may be customizable down the road.
- Initial logic also relies on creating intermediate tables, which may seem intrusive to database owners.

Proposed iterative solution

1. Use a database file within `augur filter` (#854)

Internally load metadata into a SQLite database.
- Importantly, no change to the input and output options of augur filter.
Refactor pandas logic to queries on entire data (not chunks).

2. Support populating/loading an existing database file

Desired usage:

augur load \
  --metadata metadata.tsv
  --output-metadata data.sqlite3
augur filter –-metadata data.sqlite3 ...

Add a new sub-command to load metadata into a SQLite database file.
Extend capabilities of augur filter --metadata to accepts a SQLite database file. If present, it will skip the metadata/sequence loading step.
- This behavior is similar to the existing augur index for sequences.

3. Use `sqlite3` in other Augur commands that read tabular data

Logical refactoring similar to step (1) with augur filter.
Importantly, no change to the input and output options of each command.

4. Pass database files between Augur commands (?)

far out, likely to change

augur filter –-data data.sqlite3 ... --output-data data-filtered.sqlite3
...
augur refine --data data-filtered.sqlite3 ...
...

Add --output-data to augur filter.
Add --data and/or --output-data to other Augur commands.
Retain existing functionality of using TSV/FASTA files.

5. Support remote databases (?)

far out, likely to change

augur filter –-database "postgresql://user:pass@some.postgresql.endpoint:5432/database" ...

Steps with unclear placement in solution

Load sequences into database (see 06e56981432be868939b9d40cadb0c529e99e38d)

Relevant past and present work

854
1094
1242

Thank you writing this up, @victorlin! For the CLI, we should consider continuing to support separate --metadata and --sequences arguments vs. adding a --database argument. The pandas read_csv API that gets called behind the scenes to load metadata already supports a nice generic interface where any valid file path or URL can be passed as input. This interface allows us to run augur filter like:

augur filter \
  --metadata s3://nextstrain-data/metadata.tsv.gz \
  --output-metadata filtered_metadata.tsv

augur filter \
  --metadata https://data.nextstrain.org/metadata.tsv.gz \
  --output-metadata filtered_metadata.tsv

This kind of interface could be extended to support databases (and TSVs, etc.) like so:

augur filter \
  --metadata metadata.sqlite3 \
  --output-metadata filtered_metadata.tsv

augur filter \
  --metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
  --output-metadata filtered_metadata.tsv

Obviously the issue is complicated if we expect sequences and metadata to be in the same database to maintain internal consistency. The interface above could become frustrating, confusing, and/or inconsistent if users had to specify database paths multiple times like:

augur filter \
  --metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
  --sequences "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
  --output-metadata filtered_metadata.tsv \
  --output-sequences filtered_sequences.fasta

# Valid but potentially confusing if databases are not consistent.
augur filter \
  --metadata "postgresql://user:pass@some.postgresql.endpoint:5432/database" \
  --sequences sequences.sqlite3 \
  --output-metadata filtered_metadata.tsv \
  --output-sequences filtered_sequences.fasta

Whatever interface we choose, we should consider how/whether to abstract the type of input data from how it is being stored. For example, --database could become --data if it represents both sequences and metadata.

@victorlin, @tsibley, and I discussed this a bit previously, but we should also consider what internal interface we want for metadata and sequences, so we can represent metadata records in a way that doesn't couple the records to the storage medium. For example, do we want to always represent metadata records internally as pandas DataFrames (for the functionality that comes with them) even if the data are stored in a database? Or do we want to define our own Metadata or MetadataRecord classes to provide access to records? These don't have to be too complicated; they could be implemented as dataclasses.

We currently pass around pandas DataFrames internally, which strongly couples our internal use of metadata to how they are stored. It is easy to imagine changing all of this code to work with database query sets instead such that all of our code strongly coupled to the database implementation. We might consider using factories that implement methods to generate Metadata and Sequence records. Thinking about these abstractions now could free us in the future to switch underlying implementations or implement internal workarounds to issues with the APIs of the tools we depend on (e.g., BioPython, pandas, GFF parsers, etc.).

Update: after experimenting with loading sequence data into the database (06e56981432be868939b9d40cadb0c529e99e38d), it turns out to be unnecessary overhead for augur filter and will come later. More info in Slack thread.

@huddlej

I like the idea of expanding --metadata to handle databases. Though we are postponing sequences in database for now, this will come up eventually and your point about having both --metadata and --sequences is important to address. I don't think we should support two separate databases, that would complicate things and we'd want to JOIN on sequence IDs within a database.

This sequence of features might work:

Support --metadata <sqlite_file>
- <sqlite_file> is a SQLite3 database file, detected by filename ending in .sqlite3
- The database must contain a table metadata, which has rows/columns comparable to currently supported metadata files.
Support --output-metadata <sqlite_file> as the counterpart to (1).
Support loading sequences into database.
Support --data <sqlite_file>
- <sqlite_file> is the same thing as in (1).
- The database must contain a table metadata, same thing as in (1).
- The database must contain a table sequences, which contains 2 columns:
  - id: sequence identifier (comparable to Bio.SeqRecord.id)
  - seq: sequence identifier (comparable to Bio.SeqRecord.seq)
    - This should support compression of some sort, though we want to be careful on how to support/determine various compression methods.
Support --output-data <sqlite_file> as the counterpart to (4).

we should also consider what internal interface we want for metadata and sequences, so we can represent metadata records in a way that doesn't couple the records to the storage medium

The work in #854 replaces the internal logic in augur filter (that relies on passing around DataFrames) with reliance on intermediate tables having pre-defined names:

https://github.com/nextstrain/augur/blob/df83e1ba31cfb8290ff0316da6ab6755f490c661/augur/filter_support/db/sqlite.py#L18-L25

With this approach, the record ID column (e.g. strain) is used for JOINs to pull info from other intermediate tables and create new tables. This is similar to what's being done currently:

https://github.com/nextstrain/augur/blob/b9e0c7908106fda0d48a3e06d01e035f1c328384/augur/filter.py#L1400-L1404

I'm not familiar with other augur subcommands yet, but maybe this approach can be used elsewhere.

I also thought about dataclasses/sqlalchemy in #854 but decided not to pursue the additional complexity. Would be happy to reconsider during review or as a next step. This would also help in supporting non-SQLite databases, since #854 is using raw SQL queries/commands with SQLite-specific syntax.

A couple thoughts, re-reading thru this now. I generally concur with most things above.

+1 for supporting remote URLs (s3://, https://, etc.), but though the interfaces overlap, I see implementing this as separate/distinct from supporting a SQLite database as input. Remote vs. local is an I/O layer transport detail (e.g. implementable with fsspec) distinct from the format of the input (e.g. CSV vs. SQLite).

+1 for a first step of detecting local SQLite files given to --metadata, though I would recommend using magic bytes (e.g. via the libmagic library or otherwise) in the SQLite header rather than file extensions, as these vary (.sqlite3, .db, nothing, etc).

I just wrote a comment summarizing the current state of things on augur filter --engine sqlite: https://github.com/nextstrain/augur/pull/854#issuecomment-1302820016

Thinking more broadly, it might not be a bad idea to begin work on the database loading command in parallel to refining #854. This allows people to get a feel for how SQL queries work on metadata/sequences. This could look something like:

augur db load \
  --metadata metadata.tsv \
  --sequences sequences.fasta \
  --to-sqlite3 data.db

sqlite3 data.db "SELECT * FROM metadata WHERE region == 'europe'" > metadata-subset.tsv

EDIT: Started on this in #1094.

@tsibley and I just discussed this in our 1:1 chat. Next steps:

Continue #1242
1. Create separate internal databases for input metadata table and intermediate tables.
2. Test with large datasets.
Pick back up on CSV↔SQLite subcommand (#1094).
Look into adding sequences (with compression).

nextstrain / augur

Support databases #866

Context

Assessment

Proposed iterative solution

1. Use a database file within `augur filter` (#854)

2. Support populating/loading an existing database file

3. Use `sqlite3` in other Augur commands that read tabular data

4. Pass database files between Augur commands (?)

5. Support remote databases (?)

Steps with unclear placement in solution

Relevant past and present work

854

1094

1242

nextstrain / augur

Support databases #866

Context

Assessment

Proposed iterative solution

1. Use a database file within augur filter (#854)

2. Support populating/loading an existing database file

3. Use sqlite3 in other Augur commands that read tabular data

4. Pass database files between Augur commands (?)

5. Support remote databases (?)

Steps with unclear placement in solution

Relevant past and present work

854

1094

1242

1. Use a database file within `augur filter` (#854)

3. Use `sqlite3` in other Augur commands that read tabular data