jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
402 stars 200 forks source link

Walk_together cannot handle multiple mutation on a same position #315

Open Wan-Yifei opened 5 years ago

Wan-Yifei commented 5 years ago

Hi,

I want to process two vcf files: file#01:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  OGV-FL1-1_S1_mkdup.bam
chr1    131281  .       C       T       100     PASS    DP=152  GT:GQ:AD:DP:VF:NL:SB    0/1:100:75,77:152:0.507:20:-100.0000
chr1    131310  .       G       C       100     PASS    DP=256  GT:GQ:AD:DP:VF:NL:SB    0/1:100:191,65:256:0.254:20:-3.0607
chr1    131315  .       T       C       100     PASS    DP=305  GT:GQ:AD:DP:VF:NL:SB    0/1:100:113,192:305:0.630:20:-60.1999

And the file#2:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  OGV-FL1-1_S1_mkdup.bam
chr1    131281  .       C       G       100     PASS    DP=152  GT:GQ:AD:DP:VF:NL:SB    0/1:100:75,77:152:0.507:20:-100.0000
chr1    131281  .       C       T       100     PASS    DP=152  GT:GQ:AD:DP:VF:NL:SB    0/1:100:75,77:152:0.507:20:-100.0000
chr1    131310  .       G       C       100     PASS    DP=256  GT:GQ:AD:DP:VF:NL:SB    0/1:100:191,65:256:0.254:20:-3.0607
chr1    131315  .       T       C       100     PASS    DP=305  GT:GQ:AD:DP:VF:NL:SB    0/1:100:113,192:305:0.630:20:-60.1999

My code is:

## f_1 is the reader of file#1
## f_2 is the reader of file#2

get_key = lambda r: (r.CHROM, r.POS, r.REF, r.ALT)
walker = utils.walk_together(f_1, f_2, get_key)

test = [record for record in walker]

The test raised an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/local/lib/python3.2/dist-packages/vcf/utils.py", line 45, in walk_together
    min_k = min(next_idx_to_k.values())   # move on to next contig
TypeError: unorderable types: _Substitution() < _Substitution()

I also tried with default key:

walker = utils.walk_together(f_1, f_2)

test = [record for record in walker]

There is no error but the output looks weird:

>>>print(test)
>>>
##pseudo-output
[[f1_line1 ,f2_line1 ],
[None, f1_line2] ,
[f1_line3, f2_line3]]

As you can see, the line2 of file2 is as same as the line1 of file1. But it was combined with None. Could you please help me figure out this issue?

I think the reasonable output would be:

[[f1_line1 ,f2_line2 ],
[None, f2_line1] ,
[f1_line3, f2_line3]]

Thank you so much!