Closed kerrimalone closed 4 years ago
As i understand this, this is a bug in the final VCFs, which should have identical VCF records for all samples, except for the genotype column
What do you mean by "final"? The very end of the pipeline, or per-sample vcfs after regenotyping? The pipeline uses the "debug" vcf for each regenotyped sample, when it merges to make one wide vcf. But the "final" vcf in each per sample regenotyped run will also be there and should be ignored.
ah, so let me clarify. you have a final per-sample vcf, and then a debug combined regenotyped thing. then you try to make a wide one, and that dies/fails. but if it did work, obviously you'd have the same records for all. an alternative is to back off making a wide vcf, but simply make a single vcf per sample which has the same records as you would have had in the wide one
i think i thought the final per-sample one was after the debug one. danger of the word final!
Sorry, wasn't clear. I think you get it. But to be explicit, these are the stages of the pipeline:
From a user point of view, if the pipeline didn't crash on large numbers of samples, then all you would see is the final, wide, VCF file made in stage 5. The intermediate files, which includes those in 4, are in Nextflow's work directory and ideally should be thought of as temporary and deleted at the end of the pipeline.
The problem at the moment is that because 5 crashes, if you want the same alleles in all samples, then you want to take the debug VCFs from 4.
Hope that makes sense!
it does make sense. one more question - in your merge step, if it did not crash, would it remove the alleles that are never seen in any of the samples?
so i think i need to either
oh joy this is still open
In reference to this issue: https://github.com/iqbal-lab-org/minos/issues/10
The list of ALT alleles found in final VCFs outputted by multi sample pipeline differs between files.