Closed jburos closed 7 years ago
@jburos: thanks a lot for this workaround!
Thinking about this a bit more: since this is a data issue (that happened to be introduced due to a snpeff bug) and since we don't normally expect this to happen; maybe it is easier to not try to handle this case at all within cohorts. I know that we are currently trying to get things going for RCC, but I think it is fair for cohorts to fail fast and hard on misformatted input files.
How about we manually fix this file for now (that is, make it a proper VCF but with no variants in it) and I will address this issue in biokepi so that this bug never manifests itself on the cohorts
level?
@armish that's great -- if you can fix it on the biokepi level then I can avoid using this branch. Mainly I wanted to get things working quickly & painlessly :)
Having the option to skip over & report on bad imports might be a useful feature, but not a high priority. This work-around is more of a hack & isn't robust enough to warrant being part of cohorts
master just yet.
All right - after an hour of digging into VCF standards and the commonly used parsing libraries looks like I was too optimistic about having a proper empty VCF. Apparently, none of the parsers expect to read in an empty VCF and they all throw StopIteration
once they are at the end of the file with no variants in the memory.
Do you think it makes sense to just drop that file from the metadata since we know that it doesn't have any information in it or would it break some part of the cohorts?
Also: heads up that I rebased your branches since my GCloud experimentation somehow sneaked into the other ones and was making tests fail on your branches. Don't forget to git pull -X theirs
to accept those changes from the remote :spiral_notepad:
Thanks @armish --
I've been using your gcloud utilities in my branch of the rcc project, so personally I've kept them in this branch just cause I didn't expect to merge it into master (and i need one "complete" branch to reference for that project). Thinking now I will create a temporary "rcc-analysis" branch just to keep those merges separate from the ones intended for master?
I'm OK with either removing that file from the index, or catching & reporting on that error. Either way. If this is the output from snpeff, and if it failed, then maybe the failure should be handled by biokepi? otherwise if this is indeed the "empty snp file" then it seems reasonable to catch only that StopIteration error & return a value of 0 for number of variants.
Either way, welcome your feedback / thoughts on how to best address both of the above!
Ah, crap. messed it up when I pushed after running git pull -X theirs
. Didn't mean to do that! I'll fix.
Sorry for the hassle that rebase caused you - didn't realize you were using the gcloud part, apologies. Looks like you were already reverted it back but I have the original state of the repo in case that didn't solve the basing issue.
Re: snpeff: you are right - if the snpeff run had failed (and existed with a non-zero status), the pipeline would have stopped there. So this looks like successful run without any variants and I think your suggestion to catch the StopIteration
and return an empty list would the best solution here.
Let me know if there are parts I can take over to give you more time with the clinical part.
No problem - thanks @armish for merging your gcloud tools into master. This will make things easier for me. I will rebase again, :) thus incorporating those edits into this working branch.
Otherwise, what do you think we should do with this branch-in-progress? LMK if you think it's OK to merge as-is or if we should address other issues.
In one of our cohorts, we have a record with an empty snpeff file. In this case, the VCF is empty because there are no variants identified, however it could happen for any reason.
Currently, there is an error reading in this file. For now, I would like the option of warning on that load so that the other VCFs can be read in.
Current limitations of this approach are:
Leaving this PR as [WIP] until these two issues are addressed.