harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Interval creation errors #54

Closed tsackton closed 2 years ago

tsackton commented 2 years ago

In the interval creation code, we have a sanity check that ensures that a given interval set is the expected length. That is, we check that intervals in the interval list cover 100% of the bases in the list of ATGCmers from the Picard output. If this does not happen, we fail with an error and exit the pipeline.

However, we give no obvious clue as to what to do if this check does fail. Probably the correct answer is that if the number of base pairs missing is small, this doesn't really matter. The one case I have observed this, it arises because there is a scaffold that starts with a T, and then has a stretch of ~5000 Ns. That T is probably an error, to be honest, and in any case it is not going to matter at all if no SNPs are called there.

Relatedly, the interval creation code probably shouldn't fail if the max BP per interval value is too low, it should just take the smallest value that is useable.

Ultimately we should probably rewrite most or all of this code so that the pipeline 'just works' and figures out the best set of intervals that is possible to create. Posting this issue to track that task, although this is not high priority at the moment.

tsackton commented 2 years ago

Should be closed by #61