nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
214 stars 58 forks source link

nextclade run --output-columns-selection throws error for seqName and includes index even though I don't want index #1459

Closed AngieHinrichs closed 3 months ago

AngieHinrichs commented 3 months ago

I expected that --output-columns-selection would allow me to specify the exact columns that I want (no index, just seqName and some others), but when I tried this option

                --output-columns-selection seqName,clade,totalSubstitutions,totalDeletions,totalInsertions,totalMissing,totalNonACGTNs,substitutions,deletions,insertions,aaSubstitutions,aaDeletions,aaInsertions,missing,unknownAaRanges,nonACGTNs

nextclade run threw this error:

Error: 
   0: Output columns selection: unknown column or category name 'seqName'.

It listed possible columns starting with clade. I think it should include index and seqName in the list of possible columns.

When I removed seqName from the option, there was no error but I got both index and seqName in the output TSV. seqName is good but I don't want index.

ivan-aksamentov commented 3 months ago

I don't remember why I decided to force include them, might need to reconsider.

Seq names are not guaranteed to be unique, so it's impossible to exactly map inputs to outputs (e.g. input fasta and TSV) or outputs to outputs (e.g. output fasta and TSV) just by name alone. That's the point for having index. Without it the output contains an incomplete information.

But I guess it makes sense that you should be able to opt-out from any column, including index and name. For example, if you deduplicated inputs in advance, then the names can be treated as unique identifiers and you won't need an index.

AngieHinrichs commented 3 months ago

It would usually be a mistake to omit seqName, and I see why index would be useful given duplicated sequence names but I try to avoid those by using accessions/IDs. Sometimes it's nice to look out for the user asking for something that doesn't make sense, but in this case I would prefer a one-to-one mapping between the columns I ask for and the columns I get. 🙂

AngieHinrichs commented 3 months ago

Another side effect of forcing index and seqName is that it makes the output different in a subtle way for sequences that can't be aligned. When --output-columns-selection is not given, sequences that can't be aligned have the empty string in all columns except for warnings, so I can check for empty string in the seqName column to discard those lines of output.

However, when using --output-columns-selection, index and seqName have real values even if all other selected columns have the empty string. Then in order to know which lines to discard, I need to check for empty string in some column that should always have an integer, because empty string is a valid value for some columns.

ivan-aksamentov commented 3 months ago

When --output-columns-selection is not given, sequences that can't be aligned have the empty string in all columns except for warnings, so I can check for empty string in the seqName column to discard those lines of output [...] when using --output-columns-selection, index and seqName have real values even if all other selected columns have the empty string

Wow! That's definitely a bug! You should not rely on empty seqName cell. If seqName column is present (currently it always is, but we agreed above that we will change that), then it should always contain sequence name. And --output-columns-selection should only affect presence of the column (again, it does not right now, but I'll change that), not its contents.

The idea is that Nextclade always outputs all the data it has for a particular sample, no matter at which point there's an error in the analysis pipeline. When alignment fails there is nothing much to output - there's an index, seqName and error, so that what's the row supposed to contain.

I think if you are trying to filter out sequences that failed for one reason or the other, then there should be another mechanism. I thought that populated "errors" column should be that. Can you try and see if it works for you? If not, we can discuss addition of another mechanism (please open another issue if you need this mechanism). But I insist that this is not related to the --output-columns-selection in any way and accidentally missing seqName is not a good indicator.

AngieHinrichs commented 3 months ago

Well, I tried to make a little test case for the missing-seqName behavior but now I am unable to reproduce it, so... false alarm! Sorry about that.

I see your point about using the errors column to filter out results -- but it is somewhat verbose (When processing sequence #0 'U05330.1': When calculating seed matches: Unable to align: seed alignment covers 4.02% of the query sequence, which is less than expected 5.00% (configurable using 'min seed cover' CLI flag or dataset property). This is likely due to low quality of the provided sequence, or due to using incorrect reference sequence.)and I was hoping to make a very small version of the output using --output-columns-selection. I guess with a little awk or grep it could still be very small. :)

ivan-aksamentov commented 3 months ago

@AngieHinrichs

I see your point about using the errors column to filter out results -- but it is somewhat verbose. [..] I guess with a little awk or grep it could still be very small. :)

I don't think you should grep or awk that! Just check that the error cell is not empty. If there's something there, then it's a sign that this entire row is a failure and needs to be discarded. There will never be anything useful in this row - only index, seqName and error message - because that's all we have. The message there is only suited for human consumption. So, depending on your use-case, if there's a human looking at your derived result, then you could gather and display the errors to them, but we cannot support stability of these messages, so they are not to be parsed or relied on.

For non-fatal errors we have "warnings" column. If there were non-fatal errors, but there's still useful information e.g. alignment succeeded and we could call nuc mutations, but all translations failed, then "errors" cell will be empty, but "warnings" will contain something. Again, messages are only for human consumption. These have no stability guarantees.

We cannot omit entire errored rows, or just indices or seqNames, like you suggested because:

Let me know if checking for empty "error" cells works for you, or if you require another solution. In which case, ideally, please propose this solution or explain your use-case in more details.

AngieHinrichs commented 3 months ago

It's fine and I certainly don't want to break anybody else's use case.

I do want to circle back to the original request: treating --output-columns-selection as an exact specification of columns in output, so if I pass in --output-columns-selection seqName,clade then there is no error unknown column or category name 'seqName' and I get only those columns of output, not index.

ivan-aksamentov commented 3 months ago

Yes, I will change this thing soon(ish). Just need to finish something else first.

(If you have capacity - feel free to PR)

ivan-aksamentov commented 3 months ago

Implemented in https://github.com/nextstrain/nextclade/pull/1472.

ivan-aksamentov commented 3 months ago

Released in 3.7.0

AngieHinrichs commented 3 months ago

Thanks Ivan!