hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Warn on vcf errors #222

Closed jburos closed 7 years ago

jburos commented 7 years ago

In one of our cohorts, we have a record with an empty snpeff file. In this case, the VCF is empty because there are no variants identified, however it could happen for any reason.

Currently, there is an error reading in this file. For now, I would like the option of warning on that load so that the other VCFs can be read in.

Current limitations of this approach are:

  1. Once results are cached, warnings no longer printed. Ideally we should skip caching results for these patients if a warning is encountered.
  2. The warn_on_failure should really be an option, configurable at the level of the cohort & function call.

Leaving this PR as [WIP] until these two issues are addressed.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-2.5%) to 54.451% when pulling 6d2b47fd4e32fff9ee44e566adc2dc41b29f323d on warn-on-vcf-errors into 31545c4b03d9a1edd9cc543ac1f863a140e88cd8 on master.

armish commented 7 years ago

@jburos: thanks a lot for this workaround!

Thinking about this a bit more: since this is a data issue (that happened to be introduced due to a snpeff bug) and since we don't normally expect this to happen; maybe it is easier to not try to handle this case at all within cohorts. I know that we are currently trying to get things going for RCC, but I think it is fair for cohorts to fail fast and hard on misformatted input files.

How about we manually fix this file for now (that is, make it a proper VCF but with no variants in it) and I will address this issue in biokepi so that this bug never manifests itself on the cohorts level?

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.02%) to 56.968% when pulling 7048d44f23163ba5c39388f1c45b29c1521be259 on warn-on-vcf-errors into 31545c4b03d9a1edd9cc543ac1f863a140e88cd8 on master.

jburos commented 7 years ago

@armish that's great -- if you can fix it on the biokepi level then I can avoid using this branch. Mainly I wanted to get things working quickly & painlessly :)

Having the option to skip over & report on bad imports might be a useful feature, but not a high priority. This work-around is more of a hack & isn't robust enough to warrant being part of cohorts master just yet.

armish commented 7 years ago

All right - after an hour of digging into VCF standards and the commonly used parsing libraries looks like I was too optimistic about having a proper empty VCF. Apparently, none of the parsers expect to read in an empty VCF and they all throw StopIteration once they are at the end of the file with no variants in the memory.

Do you think it makes sense to just drop that file from the metadata since we know that it doesn't have any information in it or would it break some part of the cohorts?

armish commented 7 years ago

Also: heads up that I rebased your branches since my GCloud experimentation somehow sneaked into the other ones and was making tests fail on your branches. Don't forget to git pull -X theirs to accept those changes from the remote :spiral_notepad:

jburos commented 7 years ago

Thanks @armish --

  1. I've been using your gcloud utilities in my branch of the rcc project, so personally I've kept them in this branch just cause I didn't expect to merge it into master (and i need one "complete" branch to reference for that project). Thinking now I will create a temporary "rcc-analysis" branch just to keep those merges separate from the ones intended for master?

  2. I'm OK with either removing that file from the index, or catching & reporting on that error. Either way. If this is the output from snpeff, and if it failed, then maybe the failure should be handled by biokepi? otherwise if this is indeed the "empty snp file" then it seems reasonable to catch only that StopIteration error & return a value of 0 for number of variants.

Either way, welcome your feedback / thoughts on how to best address both of the above!

jburos commented 7 years ago

Ah, crap. messed it up when I pushed after running git pull -X theirs. Didn't mean to do that! I'll fix.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-3.7%) to 53.291% when pulling d21a082fe4e54e1cfa7bfde26eed3feeee50cc33 on warn-on-vcf-errors into 31545c4b03d9a1edd9cc543ac1f863a140e88cd8 on master.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-1.3%) to 55.688% when pulling 620c9e9f6ec76cbcfaf2a05dd4030b97823b266d on warn-on-vcf-errors into 31545c4b03d9a1edd9cc543ac1f863a140e88cd8 on master.

armish commented 7 years ago

Sorry for the hassle that rebase caused you - didn't realize you were using the gcloud part, apologies. Looks like you were already reverted it back but I have the original state of the repo in case that didn't solve the basing issue.

Re: snpeff: you are right - if the snpeff run had failed (and existed with a non-zero status), the pipeline would have stopped there. So this looks like successful run without any variants and I think your suggestion to catch the StopIteration and return an empty list would the best solution here.

Let me know if there are parts I can take over to give you more time with the clinical part.

jburos commented 7 years ago

No problem - thanks @armish for merging your gcloud tools into master. This will make things easier for me. I will rebase again, :) thus incorporating those edits into this working branch.

Otherwise, what do you think we should do with this branch-in-progress? LMK if you think it's OK to merge as-is or if we should address other issues.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.03%) to 53.291% when pulling ded579f98f6bc22db25a243b4a310487d9e70646 on warn-on-vcf-errors into f9d5fc7fefc7f96cadbc3a87cb9a5cfad0b55409 on master.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.03%) to 53.291% when pulling ded579f98f6bc22db25a243b4a310487d9e70646 on warn-on-vcf-errors into f9d5fc7fefc7f96cadbc3a87cb9a5cfad0b55409 on master.