multi sample pipeline: ALT allele list in final VCF differ between samples/files

kerrimalone commented 5 years ago

In reference to this issue: https://github.com/iqbal-lab-org/minos/issues/10

The list of ALT alleles found in final VCFs outputted by multi sample pipeline differs between files.

REFGEN POS    .   AG  AT  .   .   KMER=7  DP:GT:COV:GT_CONF:GT_CONF_PERCENTILE 47:0/1:35,12:89.47:2.44

REFGEN POS    .   AG  AT,GG,GT    .   .   KMER=7  DP:GT:COV:GT_CONF:GT_CONF_PERCENTILE 12:0/0:12,0,0,0:123.74:15.31

iqbal-lab commented 5 years ago

As i understand this, this is a bug in the final VCFs, which should have identical VCF records for all samples, except for the genotype column

martinghunt commented 5 years ago

What do you mean by "final"? The very end of the pipeline, or per-sample vcfs after regenotyping? The pipeline uses the "debug" vcf for each regenotyped sample, when it merges to make one wide vcf. But the "final" vcf in each per sample regenotyped run will also be there and should be ignored.

iqbal-lab commented 5 years ago

ah, so let me clarify. you have a final per-sample vcf, and then a debug combined regenotyped thing. then you try to make a wide one, and that dies/fails. but if it did work, obviously you'd have the same records for all. an alternative is to back off making a wide vcf, but simply make a single vcf per sample which has the same records as you would have had in the wide one

iqbal-lab commented 5 years ago

i think i thought the final per-sample one was after the debug one. danger of the word final!

martinghunt commented 5 years ago

Sorry, wasn't clear. I think you get it. But to be explicit, these are the stages of the pipeline:

Split every VCF in to 'small' and 'large' variants (large variants are ignored)
Cluster all small variants from all VCF files, making a single new VCF file.
Build gramtools graph of small variants from VCF in 2.
Independently run minos on each sample, using the build from 3 as the reference graph. Important: this is a straight run of minos adjudicate. This means that each minos output directory has the debug VCF file with all alleles, and the final VCF file with unseen alleles removed.
Merge all per-sample debug VCF files (because they have the same alleles) into one wide VCF file.

From a user point of view, if the pipeline didn't crash on large numbers of samples, then all you would see is the final, wide, VCF file made in stage 5. The intermediate files, which includes those in 4, are in Nextflow's work directory and ideally should be thought of as temporary and deleted at the end of the pipeline.

The problem at the moment is that because 5 crashes, if you want the same alleles in all samples, then you want to take the debug VCFs from 4.

Hope that makes sense!

iqbal-lab commented 5 years ago

it does make sense. one more question - in your merge step, if it did not crash, would it remove the alleles that are never seen in any of the samples?

so i think i need to either

fix the merge step so it does not crash (and potentially remove all alleles that never occur)
parse all the debug vcfs, remove all the alleles that never occur, and i am left with one VCF per sample, but now they all have the same sites and alleles.

iqbal-lab commented 5 years ago

oh joy this is still open

iqbal-lab-org / minos

multi sample pipeline: ALT allele list in final VCF differ between samples/files #69