iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

pandora vcf question #260

Closed smb20200615 closed 3 years ago

smb20200615 commented 3 years ago

Are there any tools to convert the pandora vcf to a regular vcf? I tried using vcf2scoary (https://github.com/AdmiralenOla/Scoary/blob/master/scoary/vcf2scoary.py) to get a pangenome variant matrix with the pandora vcf and it didn't work maybe because of the differences in the vcf formatting? I would really appreciate your guidance on how to use the output of pandora with existing downstream analysis tools

iqbal-lab commented 3 years ago

Hi there Can you be a bit more specific about what you mean by a "regular VCF" ? What downstream tools do you have in mind?

smb20200615 commented 3 years ago

Hello, I am mostly interested in the pangenome variant matrix so I would input the VCF from pandora to scripts such as https://github.com/AdmiralenOla/Scoary/blob/master/scoary/vcf2scoary.py. This would generate a matrix of variants per genome. It is not currently working. Also, would you recommend any tools to use to find regions of the genome with the most variation?

leoisl commented 3 years ago

Hello,

I can confirm that https://github.com/AdmiralenOla/Scoary/blob/master/scoary/vcf2scoary.py fails with pandora VCFs with the following error message:

Traceback (most recent call last):
  File "vcf2scoary.py", line 224, in <module>
    main()
  File "vcf2scoary.py", line 167, in main
    line = next(lines)
_csv.Error: field larger than field limit (131072)

As another test, it worked fine with a single-sample snippy VCF. As this seems to be a useful use case, I will debug this error and provide a script that will fix pandora VCF such that vcf2scoary.py will accept it. I will keep you posted!

Cheers

leoisl commented 3 years ago

Dear @smb20200615

I think this was a small fix on the Scoary script, if you have the same issue as me. I fixed this issue in a branch of the Scoary repo (please see the new script here), and was able to run vcf2scoary.py on a VCF produced by pandora when comparing 20 samples (from the paper). Could you please let me know if this fix also works for you?

We had this issue due to pandora creating a very long VCF record. This is a side effect of trying to represent graph variants with the VCF format. Sometimes we have very dense regions in the graph, and to describe and genotype them linearly, as required by the VCF format, we sometimes have to merge and linearise all variants present in any sample in the graph and put them in a VCF, building sometimes a long VCF record.

mbhall88 commented 3 years ago

Another thing you could try is trimming alts and normalising the VCF using bcftools. In my experience this generally reduces the size of a lot of alleles.

An example from one of my pipelines

$ bcftools view -a -O v <pandora vcf> | bcftools norm -c e -f <pandora vcf ref fasta> -o pandora.norm.vcf
smb20200615 commented 3 years ago

Thank you so much for your super helpful guidance as always. Worked well for me!