poseidon-framework / poseidon-hs

A toolset to work with modular genotype databases in the Poseidon format
https://poseidon-framework.github.io/#/trident
MIT License
7 stars 2 forks source link

Improve the error messages in case of a genotype data parsing failure #224

Open nevrome opened 1 year ago

nevrome commented 1 year ago

At the moment an issue in the genotype data is always reduced to

Issues in genotype data parsing:
SeqFormatException "Error while parsing: not enough input. Error occurred when trying to parse this chunk: \"...\""

That often does not help to identify and solve the underlying issue, because it omits in which package + SNP (+ individual?) the problem occurred. If such an error comes up in a big forge, debugging becomes a search for the needle in the haystack. The short snipped of the relevant chunk in the error message above can be pointless, when the genotype data is in a binary format.

I wonder if there is a way to include additional, crucial information in this error message.

stschiff commented 1 year ago

We can definitely include a size-check for the genotype data, which I'm happy to include into our validation pipeline.

nevrome commented 1 year ago

It has become clear that a size-check is not possible. But independent of that I was hoping for more.

I hope we could get an error message that looks something like this:

Issues in genotype data parsing:
Can not parse SNP A in line B of file C of package D for individual E.
Error occurred when trying to parse this chunk: "this is not the SNP you're searching for"

Is this science fiction with our current implementation?

stschiff commented 1 year ago

OK, so going back to size checks: Since we do have the snpSet (1240K, HumanOrigins, Other) in the YAML file, we should actually be able to give a size-check warning after all, at least in cases where it's either 1240K or HumanOrigins. We can hardcode the expected number of SNPs for these categories and then use the number of of individuals to compute an expected byte size of the *.bed or the *.geno files. Of course, I think a mismatch between expectation and should not yield a hard error, because technically one could imagine some packages simply dropping SNPs which are uncovered (the schema doesn't forbid this, and our forging technology can handle this explicitly). But at least we can spit out a warning.

I'll work on that.

stschiff commented 1 year ago

It has become clear that a size-check is not possible. But independent of that I was hoping for more.

I hope we could get an error message that looks something like this:

Issues in genotype data parsing:
Can not parse SNP A in line B of file C of package D for individual E.
Error occurred when trying to parse this chunk: "this is not the SNP you're searching for"

Is this science fiction with our current implementation?

I think it's not science fiction. My sequence-formats parsers can provide all that information, it's just a matter of having all the data ready to create that error message, which might involve some refactoring here and there. I'll look into it.