iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Better first time construction of PRG string from VCF #148

Closed bricoletc closed 4 years ago

bricoletc commented 4 years ago
  1. We have one way of building PRG right now from a vcf. This is perl script https://github.com/iqbal-lab-org/gramtools/blob/master/gramtools/utils/vcf_to_linear_prg.pl . With following caveats:

    1. If records overlap, it will only take the first record, all next ones are ignored.
    2. If records are adjacent, it will enumerate all combinations. For eg I made 10 SNP records all adjacent (consecutive POS), with 3,3,4,4,4,4,4,4,2,4 alleles respectively. This generates a single record with a number of alleles which is the product of these numbers: 294912 . We should stop before this.
    3. If FILTER is not PASS, it will drop the record. Should we keep this @iqbal-lab ?

    The first of these issues can be dealt with by using vcf_clusterer module on input vcf, and running build on the output of that.

iqbal-lab commented 4 years ago

yes, drop non-pass records

leoisl commented 4 years ago

I am wondering if the responsibility of building PRGs from whatever source (MSAs, VCFs, whatever other input format) should be moved from the tools (gramtools, pandora, etc) to the make_prg repo. I will restart working on make_prg next week, as it is the memory bottleneck in the pandora denovo pipeline in some cases, but as we are focusing on pandora paper, I will just have in mind taking the input as MSAs.

A bad way to make it accept VCFs in the new make_prg implementation would be transforming VCFs to MSAs (I guess this is doable) before running it.

Where do you think we should attack this problem?

leoisl commented 4 years ago

BTW, new make_prg implementation should accept VCFs directly as soon as possible, but for this first implementation, only MSAs

bricoletc commented 4 years ago

I agree make_prg should be its own library because pandora and gramtools both need it.

I have made a Python utility for VCF to PRG string conversion which is gramtools specific for now (0-level nesting).

My only concern with VCF to MSA and then MSA to PRG string, is if you ask for nesting level 0, what do you get? No clustering happens, and the module juts enumerates all the alternatives at each variant site? Hoping you get the same as my python utility

bricoletc commented 4 years ago