merge: Support sequences

nextstrain / augur

Pipeline components for real-time phylodynamic analysis

https://docs.nextstrain.org/projects/augur/

GNU Affero General Public License v3.0

268 stars 129 forks source link

merge: Support sequences #1579

Open tsibley opened 2 months ago

tsibley commented 2 months ago

augur merge should support input --sequences and --output-sequences.

Features:

Sequences with same id are coalesced right-to-left, as with metadata rows.
Duplication handling within FASTA file? Compare to duplication handling within metadata file (i.e. no dups allowed).

Prior art:

ncov deduplicates sequences via scripts/sanitize_sequences.py using a rule that's only called when multiple inputs are involved (paralleling how scripts/combine_metadata.py is used).

See previous discussion in the PR which introduced augur merge.

Possible solutions

1631
1601

tsibley commented 2 months ago

Assigning to @victorlin per planning doc commentary.

victorlin commented 1 month ago

The proposed workaround:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
cat a.fasta b.fasta > all.fasta
augur filter --metadata c.tsv --sequences all.fasta --output-sequences c.fasta

does not work in the case of an entry existing in both a and b. augur merge will de-duplicate while cat keeps both entries, so augur filter would not properly handle (and should error on) the duplicates in all.fasta.

A proper workaround would be using seqkit rmdup in place of cat:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
seqkit rmdup b.fasta a.fasta > c.fasta
augur filter --metadata c.tsv --sequences c.fasta --output-metadata merged.tsv --output-sequences merged.fasta

noting that the order of b.fasta a.fasta is necessary to keep consistent with augur merge behavior of coalescing right-to-left. The purpose of augur filter is then to remove any entries present in sequences that are missing from metadata (and vice versa).

The workaround is still not great since the metadata and sequence files are loosely coupled. augur merge could more tightly couple this by requiring the NAME=FILE input format. Take these hypothetical user errors and warnings as an example:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences b=b.fasta a=a.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta c=c.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta \
  --output-metadata c.tsv
  --output-sequences c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

tsibley commented 1 month ago

Hmm. I'm not sure about these example errors. I know the original thinking was to pair metadata and sequences via the names given, but upon further reflection, I'm not sure we should require it.

# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.

Why require a matched order? That seems unfriendly to me, like asking the user to do unnecessary tedium for the computer's sake. I'd think to either

a. require named sequence inputs and make the order of --metadata be the authoritative one regardless of order in --sequences, or b. don't require named sequence inputs and make the order of --sequences be the order to merge sequences, regardless of the order of --metadata. Names, if given, would allow for stricter warnings/validation of the metadata matching sequences.

Of the two, (b) would be better, I think.

If (a) and we want to support invocations without --metadata (i.e. --sequences only), then the order given to --sequences can matter in that specific case. For (b), this isn't an issue.

# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.

It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata. With strictly named pairs of metadata/sequence files, that wouldn't be possible and a stub metadata file would have to be fabricated for the sequences to avoid this error.

# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

These warnings seem good to me in general, but may need tweaking if behaviour (a) or (b) above is chosen instead.

victorlin commented 1 month ago

Before debating requirement of a matched order we should first settle on whether to require named sequence inputs.

The only way to have these warnings

# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

is if the different inputs could be paired together (e.g. via named sequence inputs). Without this pairing, there is not much benefit to allowing --metadata and --sequences in the same command. I think at that point, it would be more clear if separate commands are used to signify that no pairing/validation is happening:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
augur merge --sequences b.fasta a.fasta --output-sequences c.fasta

How about (c) require named sequence inputs when used with metadata? I'll provide an example below.

It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata.

This is reasonable, thanks for providing the example. As an extension: suppose there are datasets A (a.tsv, a.fasta) and B (b.tsv, b.fasta, b_corrected.fasta). We could support all in one command:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a.fasta b.fasta b_corrected.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

but that wouldn't do any validation between metadata/sequences. It might as well be separate commands:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --output-metadata merged.tsv
augur merge \
  --sequences a.fasta b.fasta b_corrected.fasta \
  --output-sequences merged.fasta

# Create single FASTA file for B
augur merge \
  --sequences b.fasta b_corrected.fasta \
  --output-sequences b_corrected_merged.fasta

# Merge with paired validation
augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b_corrected_merged.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b_corrected_merged.fasta is missing from b.csv. It will not be present in any output.

tsibley commented 1 month ago

Some specific thoughts below, but the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.

The only way to have these warnings […] is if the different inputs could be paired together (e.g. via named sequence inputs).

Couldn't we have the warnings list all files if not paired by name?

# WARNING: Sequence `XXX` in a.tsv is missing from all sequence files.
# WARNING: Sequence `YYY` in b.fasta is missing from all metadata tables.

Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta b=b_corrected.fasta cross_input_corrected.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge.

It might as well be separate commands:

Except that as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.

victorlin commented 1 month ago

Couldn't we have the warnings list all files if not paired by name? Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?

Sure, both of those seem reasonable.

Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge.

Yes. That's how augur filter works currently, and I'm assuming we want to stay consistent:

https://github.com/nextstrain/augur/blob/f1d65fb733a16e7acffef51676f9571bd8e83e16/tests/functional/filter/cram/subsample-max-sequences-with-probabilistic-sampling-warning.t#L19-L25

as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.

If augur merge --sequences a.fasta b.fasta is a wrapper for seqkit rmdup b.fasta a.fasta, I don't see any overhead with separate commands. But this could change with implementation.

the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.

👍 good discussion of possibilities so far. Some potential questions for feedback:

Is it beneficial to validate that an entry is present in both files of an input pair? Example: warn if an entry in a.tsv is missing from a.fasta
Is it beneficial to aggregate input types and validate that an entry is present in both types? Example: warn if an entry in a.tsv + b.tsv is missing from a.fasta + b.fasta

In this example, would it be beneficial or confusing if SEQ2 in c.tsv came from b.tsv while in c.fasta it came from a.fasta?

# a: SEQ1, SEQ2
# b:       SEQ2, SEQ3

augur merge \
  --metadata a=a.tsv b=b.tsv \
  --sequences b=b.fasta a=a.fasta \
  --output-metadata c.tsv
  --output-sequences c.fasta

tsibley commented 1 month ago

Yes. That's how augur filter works currently, and I'm assuming we want to stay consistent:

Yeah, I know that's how augur filter works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge.

If augur merge --sequences a.fasta b.fasta is a wrapper for seqkit rmdup b.fasta a.fasta, I don't see any overhead with separate commands. But this could change with implementation.

Ah, I meant in this example of yours:

# Create single FASTA file for B
augur merge \
  --sequences b.fasta b_corrected.fasta \
  --output-sequences b_corrected_merged.fasta

# Merge with paired validation
augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b_corrected_merged.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

where b_corrected_merged.fasta is the extra disk space needed.

victorlin commented 1 month ago

Yeah, I know that's how augur filter works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge.

That's a good point. My mind was stuck in augur filter land but it sounds reasonable to keep filtering out of augur merge. In that case, is there any point in having paired validation between input types? If not, the simplest approach would be to implement metadata and sequence merge support to be mutually exclusive. This would avoid the complexities that arise when allowing them to be used together such as which order to coalesce and the extent of paired validation.

In the future, if paired validation becomes necessary in workflows, we can consider adding support for both metadata+sequences in a single command as an additional feature.

victorlin commented 1 month ago

To summarize, there are two different approaches:

1631
1601

(1) is much simpler to implement: the bulk of it is an alias to seqkit rmdup.

(2) needs work to first define how much cross-checking to do (this was discussed above). Depending on the amount of cross-checking, it may be necessary to:

split the metadata merging logic into database import / merge output
read the sequence IDs into a database table
add validation across metadata and sequence tables

The prototypes incorporate all of the above and should be functional enough to help decide which approach we want to take, at least initially.

victorlin commented 1 month ago

Thinking through this again, we could allow all these scenarios in the same implementation:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences c.fasta b.fasta a.fasta
# WARNING: Sequence inputs are unnamed. Skipping validation between metadata and sequences.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences c=c.fasta b=b.fasta a=a.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (c,b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.
# Alternatively, use unnamed sequence inputs to skip validation between
# metadata and sequences.

augur merge \
  --metadata a=a.tsv b=b.csv c=c.tsv \
  --sequences a=a.fasta b=b.fasta c=c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

I'll gather feedback on this interface in Slack.

nextstrain / augur

merge: Support sequences #1579

Possible solutions

1631

1601

1631

1601