Open tsibley opened 2 months ago
Assigning to @victorlin per planning doc commentary.
The proposed workaround:
augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
cat a.fasta b.fasta > all.fasta
augur filter --metadata c.tsv --sequences all.fasta --output-sequences c.fasta
does not work in the case of an entry existing in both a
and b
. augur merge
will de-duplicate while cat
keeps both entries, so augur filter
would not properly handle (and should error on) the duplicates in all.fasta
.
A proper workaround would be using seqkit rmdup
in place of cat
:
augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
seqkit rmdup b.fasta a.fasta > c.fasta
augur filter --metadata c.tsv --sequences c.fasta --output-metadata merged.tsv --output-sequences merged.fasta
noting that the order of b.fasta a.fasta
is necessary to keep consistent with augur merge
behavior of coalescing right-to-left. The purpose of augur filter
is then to remove any entries present in sequences that are missing from metadata (and vice versa).
The workaround is still not great since the metadata and sequence files are loosely coupled. augur merge
could more tightly couple this by requiring the NAME=FILE
input format. Take these hypothetical user errors and warnings as an example:
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences b=b.fasta a=a.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a=a.fasta b=b.fasta c=c.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a=a.fasta b=b.fasta \
--output-metadata c.tsv
--output-sequences c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.
Hmm. I'm not sure about these example errors. I know the original thinking was to pair metadata and sequences via the names given, but upon further reflection, I'm not sure we should require it.
# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a). # Please update one to match the other, noting that when an entry is in multiple # inputs, only the entry in the last input will be kept.
Why require a matched order? That seems unfriendly to me, like asking the user to do unnecessary tedium for the computer's sake. I'd think to either
a. require named sequence inputs and make the order of --metadata
be the authoritative one regardless of order in --sequences
, or
b. don't require named sequence inputs and make the order of --sequences
be the order to merge sequences, regardless of the order of --metadata
. Names, if given, would allow for stricter warnings/validation of the metadata matching sequences.
Of the two, (b) would be better, I think.
If (a) and we want to support invocations without --metadata
(i.e. --sequences
only), then the order given to --sequences
can matter in that specific case. For (b), this isn't an issue.
# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.
It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata. With strictly named pairs of metadata/sequence files, that wouldn't be possible and a stub metadata file would have to be fabricated for the sequences to avoid this error.
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output. # WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.
These warnings seem good to me in general, but may need tweaking if behaviour (a) or (b) above is chosen instead.
Before debating requirement of a matched order we should first settle on whether to require named sequence inputs.
The only way to have these warnings
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.
is if the different inputs could be paired together (e.g. via named sequence inputs). Without this pairing, there is not much benefit to allowing --metadata
and --sequences
in the same command. I think at that point, it would be more clear if separate commands are used to signify that no pairing/validation is happening:
augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
augur merge --sequences b.fasta a.fasta --output-sequences c.fasta
How about (c) require named sequence inputs when used with metadata? I'll provide an example below.
It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata.
This is reasonable, thanks for providing the example. As an extension: suppose there are datasets A (a.tsv
, a.fasta
) and B (b.tsv
, b.fasta
, b_corrected.fasta
). We could support all in one command:
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a.fasta b.fasta b_corrected.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
but that wouldn't do any validation between metadata/sequences. It might as well be separate commands:
augur merge \
--metadata a=a.tsv b=b.csv \
--output-metadata merged.tsv
augur merge \
--sequences a.fasta b.fasta b_corrected.fasta \
--output-sequences merged.fasta
(c) would allow for this:
# Create single FASTA file for B
augur merge \
--sequences b.fasta b_corrected.fasta \
--output-sequences b_corrected_merged.fasta
# Merge with paired validation
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a=a.fasta b=b_corrected_merged.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b_corrected_merged.fasta is missing from b.csv. It will not be present in any output.
Some specific thoughts below, but the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.
The only way to have these warnings […] is if the different inputs could be paired together (e.g. via named sequence inputs).
Couldn't we have the warnings list all files if not paired by name?
# WARNING: Sequence `XXX` in a.tsv is missing from all sequence files.
# WARNING: Sequence `YYY` in b.fasta is missing from all metadata tables.
Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a=a.fasta b=b.fasta b=b_corrected.fasta cross_input_corrected.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge
.
It might as well be separate commands:
Except that as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.
Couldn't we have the warnings list all files if not paired by name? Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?
Sure, both of those seem reasonable.
Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for
augur merge
.
Yes. That's how augur filter
works currently, and I'm assuming we want to stay consistent:
as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.
If augur merge --sequences a.fasta b.fasta
is a wrapper for seqkit rmdup b.fasta a.fasta
, I don't see any overhead with separate commands. But this could change with implementation.
the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.
👍 good discussion of possibilities so far. Some potential questions for feedback:
a.tsv
is missing from a.fasta
a.tsv
+ b.tsv
is missing from a.fasta
+ b.fasta
In this example, would it be beneficial or confusing if SEQ2
in c.tsv
came from b.tsv
while in c.fasta
it came from a.fasta
?
# a: SEQ1, SEQ2
# b: SEQ2, SEQ3
augur merge \
--metadata a=a.tsv b=b.tsv \
--sequences b=b.fasta a=a.fasta \
--output-metadata c.tsv
--output-sequences c.fasta
Yes. That's how
augur filter
works currently, and I'm assuming we want to stay consistent:
Yeah, I know that's how augur filter
works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge
.
If
augur merge --sequences a.fasta b.fasta
is a wrapper forseqkit rmdup b.fasta a.fasta
, I don't see any overhead with separate commands. But this could change with implementation.
Ah, I meant in this example of yours:
# Create single FASTA file for B
augur merge \
--sequences b.fasta b_corrected.fasta \
--output-sequences b_corrected_merged.fasta
# Merge with paired validation
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a=a.fasta b=b_corrected_merged.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
where b_corrected_merged.fasta
is the extra disk space needed.
Yeah, I know that's how
augur filter
works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out byaugur merge
.
That's a good point. My mind was stuck in augur filter
land but it sounds reasonable to keep filtering out of augur merge
. In that case, is there any point in having paired validation between input types? If not, the simplest approach would be to implement metadata and sequence merge support to be mutually exclusive. This would avoid the complexities that arise when allowing them to be used together such as which order to coalesce and the extent of paired validation.
In the future, if paired validation becomes necessary in workflows, we can consider adding support for both metadata+sequences in a single command as an additional feature.
To summarize, there are two different approaches:
(1) is much simpler to implement: the bulk of it is an alias to seqkit rmdup
.
(2) needs work to first define how much cross-checking to do (this was discussed above). Depending on the amount of cross-checking, it may be necessary to:
The prototypes incorporate all of the above and should be functional enough to help decide which approach we want to take, at least initially.
Thinking through this again, we could allow all these scenarios in the same implementation:
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences c.fasta b.fasta a.fasta
# WARNING: Sequence inputs are unnamed. Skipping validation between metadata and sequences.
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences c=c.fasta b=b.fasta a=a.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (c,b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.
# Alternatively, use unnamed sequence inputs to skip validation between
# metadata and sequences.
augur merge \
--metadata a=a.tsv b=b.csv c=c.tsv \
--sequences a=a.fasta b=b.fasta c=c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.
I'll gather feedback on this interface in Slack.
augur merge
should support input--sequences
and--output-sequences
.Features:
Prior art:
scripts/sanitize_sequences.py
using a rule that's only called when multiple inputs are involved (paralleling howscripts/combine_metadata.py
is used).See previous discussion in the PR which introduced
augur merge
.Possible solutions
1631
1601