rust-bio / rust-bio-tools

A set of command line utilities based on Rust-Bio.
MIT License
182 stars 24 forks source link

Improved error messages for input data #52

Open lczech opened 4 years ago

lczech commented 4 years ago

Hi there, I'd suggest to improve some error messages:

https://github.com/rust-bio/rust-bio-tools/blob/d2a77501aaf058d8c2ae0ee56c2fc352617b4783/src/bcf/to_txt.rs#L205

The error simply states that

there is no flag type for format

and it took me quite a while digging through rust-htslib to figure out that it actually wants to say something like

Format "XX" was requested for inclusion in vcf-to-txt, but that format field does not appear in the input vcf file(s).

where "XX" is some format annotation shortcut for vcf files. The same probably applies for the info fields as well.

I'd provide a PR, but I don't know rust well enough to gauge whether the error message can occur in other circumstances as well, where my improved message would not be fitting.

All the best Lucas

lczech commented 4 years ago

Addendum: For other types of missing fields, the errors get even more confusing. The above happens when calling rbt vcf-to-txt -g --fmt DP without having the format field DP in the vcf file.

However, if instead the AD field is ill formatted, I get a rust kernel panic:

thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', src/bcf/to_txt.rs:212:51

The behaviour can be reproduced with these files (produced by freebayes), and calling

cat bcftools-view.bcf | rbt vcf-to-txt -g --fmt DP AD --info ANN

using rust-bio-tools =0.6.0, as well as rust-bio-tools =0.10.0

The field is present in the file, so I don't know exactly what is causing the issue here... Would be nice to have a fix for that, because otherwise, I cannot use rbt with freebayes.

By the way, these errors occurred while adapting the dna-seq-gatk-variant-calling workflow to other types of callers (bcftools call, and freebayes), which seem to produce vcf files that confuse rbt. The error occurs in the stats rule for converting vcf to txt.

All the best and thanks in advance Lucas

johanneskoester commented 3 years ago

Thanks a lot for reporting. Indeed, error messages should be improved here. This is now much easier with the thiserror and anyhow crates. I currently don't have the capacity, but it is certainly something we would either be happy to get a PR on, or that we will start looking at as soon as possible.

essut commented 2 years ago

Hi @lczech,

I tried to reproduce your error message (there is no flag type for format) by using the test file in tests/test.vcf and running the command rbt vcf-to-txt -g --fmt DP. However, rbt now prints the missing DP tag as empty instead of spitting out the error you described. Can you test on your file that emits this error and see if the fix helps in debugging or not? Thank you.

essut commented 2 years ago

I can reproduce the error by supplying a INFO tag to --fmt, e.g. rbt vcf-to-txt --fmt T < tests/test.vcf where tag "T" is in the INFO field. As @lczech noted, the error message might not be specific to these instances.

I made a pull request (#243) to better inform the users of the error that occurred (at least now the offending tag is printed out instead of a generic message) and suggested the users to check their inputs. Hopefully @johanneskoester can chime in and give their view on this issue.

essut commented 2 years ago

On the error about the "AD" tag, I just found out the reason why rust-bio-tools panicked.

The "AD" tag is defined as ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Number of observation for each allele">. Number=R informs the parsing library (rust-htslib) to expect two values in "AD".

However, the samples with missing values in vcf files generated from freebayes are encoded with .:.:.:.:.:.:.:. instead of .:.,.:.:.:.:.:.:.. The error thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', src/bcf/to_txt.rs:212:51 is caused by rbt trying to access this non-existent second value.

Here is the file successfully parsed after changing .:.:.:.:.:.:.:. to .:.,.:.:.:.:.:.:.:

VARIANT VARIANT VARIANT VARIANT VARIANT A       B
CHROM   POS     REF     ALT     QUAL    AD      AD
21      5038037 T       C       0.44    1       2
21      5038042 A       C       13.59   1       2
21      5082586 G       A       0.25    -2147483648     2
21      5118139 T       C       0.99    -2147483648     2
21      5119448 A       C       5.96    1       2
...

Maybe if the tag length is Alleles and an out of bounds error occurred, we can assume this is a missing value? Or maybe the upstream rust-htslib can be made to interpret a single missing value to represent the missing values for the available alleles in cases with Number=R?

As a side note, I noticed that the value for the "AD" tag is filled with the last allele value in each sample. Looking at the code, assume that this is intended? Also, the missing value is encoded as -2147483648, but I thought it should be encoded as (a blank space). Is this intended as well?