sbslee / fuc

Frequently used commands in bioinformatics
https://sbslee-fuc.readthedocs.io
MIT License
48 stars 5 forks source link

[VCF] Question on usage #63

Closed ironb25 closed 2 years ago

ironb25 commented 2 years ago

Follow up question after resolving this(https://github.com/sbslee/fuc/issues/62) on the usage.

1) I want to find common variants, and unique to each of the vcf. (2 vcf files from mutect both) 2) Secondly, I want to concatenate vcf calls from strelka and mutect, and remove duplicates.

Could you suggest which methods to use for these?

thanks, Rohan

sbslee commented 2 years ago

Because I'm currently out of town and don't have my computer with me, please excuse me for not being able to provide the fullest answers. But I will try my best :) I should be back by Sunday (in Korean time), so if below doesn't get you where you need, please let me know and I will try to provide additional support as soon as I get back.

For the first question, there are a number of ways to achieve the goal, but I suggest to use the pyvcf.VcfFrame.to_variants() method:

from fuc import pyvcf

vf1 = pyvcf.VcfFrame.from_file('in1.vcf')
vf2 = pyvcf.VcfFrame.from_file('in2.vcf')

variants1 = vf1.to_variants()
variants2 = vf2.to_variants()

common_variants = list(set(variants1).intersection(variants2))
unique_variants1 = [x for x in variants1 if x not in variants2]
unique_variants2 = [x for x in variants2 if x not in variants1]

For the second question, again, there are a number of ways but I recommend using the pyvcf.merge() method with the collapse=True option:

merged_vf = pyvcf.merge([vf1, vf2], how='outer')
merged_vf.to_file('out.vcf')

This will merge two VcfFrame objects while removing duplicate records.

ironb25 commented 2 years ago

The answer to the first question worked , thank you!

Regarding second question on merging variants from 2 callers, when doing outer join, it does not work.

 File "/home/ubuntu/miniconda3/envs/fuc/lib/python3.9/site-packages/fuc/api/pyvcf.py", line 5277, in <listcomp>
    key=lambda col: [CONTIGS.index(x) if isinstance(x, str)
ValueError: 'MT' is not in list
sbslee commented 2 years ago

@ironb25,

Thanks for the patience, I'm back! Glad to hear the first question is resolved.

For the second question, the problem occurred because currently the pyvcf.VcfFrame.sort method expects you to use either M or chrM for indicating the mitochondria contig (you are using MT). Clearly, this needs to be fixed because people need to be able to use pyvcf.VcfFrame.sort regardless of what contigs their input VCF file has. I will update this for the next release 0.34.0 -- it should be an easy fix. In the meantime, can you send me the two VCF files you are trying to merge? This way, I can make sure there are no other surprises for you. But it's totally fine if you can't send them to me.

ironb25 commented 2 years ago

I shared the file in the email.

On Mon, Jun 6, 2022 at 6:44 AM sbslee @.***> wrote:

@ironb25 https://github.com/ironb25,

Thanks for the patience, I'm back! Glad to hear the first question is resolved.

For the second question, the problem occurred because currently the pyvcf.VcfFrame.sort method expects you to use either M or chrM for indicating the mitochondria contig (you are using MT). Clearly, this needs to be fixed because people need to be able to use pyvcf.VcfFrame.sort regardless of what contigs their input VCF file has. I will update this for the next release 0.34.0 -- it should be an easy fix. In the meantime, can you send me the two VCF files you are trying to merge? This way, I can make sure there are no other surprises for you. But it's totally fine if you can't send them to me.

— Reply to this email directly, view it on GitHub https://github.com/sbslee/fuc/issues/63#issuecomment-1147312248, or unsubscribe https://github.com/notifications/unsubscribe-auth/APQ4JY2VW5NNJKPLYT2EELLVNXJCBANCNFSM5XWVFWDA . You are receiving this because you were mentioned.Message ID: @.***>

sbslee commented 2 years ago

@ironb25,

I updated the pyvcf.VcfFrame.sort method to handle custom contigs such as MT. This resolves your second question. I sent you the output VCF after merging.

FYI, I noticed that the output VCF has no genotype data (e.g. 0/1) for the samples NORMAL and TUMOR. When I looked into this more closely, I found that the strelka2.vcf file does not have the GT field at all. Was this intended?

sbslee commented 2 years ago

FYI, you can try out the updated fuc yourself with the following:

$ git clone https://github.com/sbslee/fuc
$ cd fuc
$ git checkout 0.34.0-dev
$ pip install -e .
sbslee commented 2 years ago

Since the two original issues have been resolved, I will close this issue. Please feel free to re-open it if necessary.