moeinzadeh / Ranbow

9 stars 1 forks source link

Modified VCF contains no phased variants #4

Open schrins opened 3 years ago

schrins commented 3 years ago

Hi,

I would like to run ranbow to phase a short part of a potato genome. The input I have is a set of Hifi/CCS reads, a VCF file (which was not created from the CCS reads, but from another dataset of high-coverage Illumina reads) and a reference genome. I tried to follow the steps from the main page as best as I could and ranbow was able to compute a phasing. However, what I would like to have is a modified VCF file, which contains the phasing information of the computed phasing. As far as I understand, the modVCF step is supposed to do this, but it only changes the genotypes in the VCF file, it does not change them into phased variants (separated with a "|" instead of a "/"). Is this the expected behavior? Also, the process took more than an hour on a small test region of length ~30kb and about 1900 variants, which seemed quite long (but maybe the algorithm was not designed for CCS reads).

The commands I ran are the following: python2.7 ranbow.py hap -mode index -par chr4.70Mx.small.txt -processorIndex 0 python2.7 ranbow.py hap -mode hap -par chr4.70Mx.small.txt -processorIndex 0 python2.7 ranbow.py hap -mode modVCF -par chr4.70Mx.small.txt -processorIndex 0

The used parameter file basically contains the paths to the data. I attached the test data and parameter file in case you want to reproduce the behavior [1]. The reference genome (Solyntus V1.1) is taken from a publication [2] and publicly available as download [3]. I have to admit, that I had to make a change to the ranbow code before executing it: My input VCFs do not have cigar strings and I didn't find a proper tool to just add them without re-calling all the variants (which I could not do anyways, because I do not have the original reads for this, just the resulting VCF). Therefore I added a small "algorithm" to compute proper cigar strings on the fly based on the allele sequences for the corresponding variant. I attached the modified file, where my inserted code can be found from lines 625-656.

Kind regards, Sven

[1] test_instance.zip [2] https://doi.org/10.1534/g3.120.401550 [3] https://www.plantbreeding.wur.nl/Solyntus/index.html

moeinzadeh commented 2 years ago

Dear Sven, Thanks for the detailed explanation. I am working on this issue and come back to you soon. Best, Hossein

moeinzadeh commented 2 years ago

Dear Sven (@schrins) ,

My input VCFs do not have cigar strings and I didn't find a proper tool to just add them without re-calling all the variants (which I could not do anyways, because I do not have the original reads for this, just the resulting VCF). Therefore I added a small "algorithm" to compute proper cigar strings on the fly based on the allele sequences for the corresponding variant. I attached the modified file, where my inserted code can be found from lines 625-656.

Thank you for writing the script for generating Cigar string in VCF file. I integrate your code. Now if the cigar will be available in vcf file, ranbow uses that and if not ranbow uses your code. Thanks again.

As far as I understand, the modVCF step is supposed to do this, but it only changes the genotypes in the VCF file, it does not change them into phased variants (separated with a "|" instead of a "/"). Is this the expected behavior?

The modVCF was design to detect inflated variants of variant caller. It uses the haplotype information to check if a called variant was correct or not. I didn't add the genotype column to the vcf file because of a technical issue for haplotyping with illumina data. You can use the "ranbow.single.hap" file if you want to see the haplotypes in coded allele version (0,1,2,...).

Also, the process took more than an hour on a small test region of length ~30kb and about 1900 variants, which seemed quite long (but maybe the algorithm was not designed for CCS reads).

There was a parameter ( WinLen) hard coded in the code which you can use when you use CCS reads. You can decrease this parameter to 6 in order to speed up haplotyping. I made this parameter available for users now. For example:


python ranbow.py hap -mode hap  -par Sven_issue4_data2/params.text -WinLen 6
Maximum insert length is  35000
(StSOLv1.1ch04, 1, 72008707, StSOLv1.1ch04)
Run ranbow - single mode
StSOLv1.1ch04   readBAM 0:00:07.483036  Ranbow_single   0:08:04.877853  single2file     0:00:03.381847

Before that this parameter was 8. Decrease of this parameter reduce the search space and make the algorithm faster. Maybe with adding very minor error, however I assume not for CCS reads.

Best, Hossein