starskyzheng / panpop

Application of pan-genome for population
MIT License
87 stars 8 forks source link

Command that merging outcomes of multiple SV callers #18

Closed lovelycatZ closed 6 months ago

lovelycatZ commented 6 months ago

Hello Zheng,

Thank you for providing a very nice tool. Currently, I am curious about how to merge outcomes of multiple SV callers. How to provide some vcf files to the standalone PART script? I noticed that the --in_vcf option can support only one file.

Thank you for your soon reply!

starskyzheng commented 6 months ago

Hi! You can follow this issue (#15). Bcftools need to be used.

lovelycatZ commented 6 months ago

Hello Zheng,

Thank you for your soon reply! It works well for after using BCFTOOLS to merge outcomes of multiple SV callers. But the final result file named 3.final.vcf.gz had no records, and there was only a header section. While I used the SURVIVOR to merge and I got some records. Why?

starskyzheng commented 6 months ago

Can you provid vcf files?

lovelycatZ commented 6 months ago

vcfs.zip There are 4 vcf files from 4 CNV callers.

starskyzheng commented 6 months ago

Your VCF file is not in common format. GT should be 0/0, 0/1, 1/1, 0/2, et al. However, GT in 21B01474858_cn.MOPS.vcf.gz was one number, which cannot be recognized by PanPop or SURVIVOR. GT even not exists in file 21B01474858_ExomeDepth.vcf.gz, 21B01474858_GATK_gCNV.vcf.gz, and 21B01474858_Manta.vcf.gz. You may manully add GT to those VCF files and rerun PanPop or SURVIVOR.

lovelycatZ commented 6 months ago

Yes, you are right. Actually, the output of some callers is not in VCF format (such as TSV format), so manual conversion is taken. But what if I can't infer whether a CNV is homozygous or heterozygous? Does the GT tag have to exist in the VCF files to run the PART standalone script?

starskyzheng commented 6 months ago

GT must exists to use PART script. But I dont know how to deal with CNVs, unless you can transformat it into normal vcf file. Perhaps the CNV=0 means a deletion of GT=1/1, CNV=1 means a deletion of GT=0/1, CNV=3 means a insertion of GT=0/1?

lovelycatZ commented 6 months ago

OK, I'll try to provide GT tag, thanks for your convenient time!

lovelycatZ commented 6 months ago

This time there are a lot of records in the final VCF file, but it seems like all of them are not merged. Can you give me some advice according to the data? data.zip

starskyzheng commented 6 months ago

There were no sequence info in VCF files. REF and ALT (colunm 4 and 5) need to be completed. Many be you can refer to https://github.com/starskyzheng/panpop/blob/main/scripts/long_caller_parser.pl.

lovelycatZ commented 6 months ago

Do you mean I have to have the details for column 4 and 5? But if a SV is DEL or DUP, it make no sense to present sequence in REF or ALT. And what does this script do?

starskyzheng commented 6 months ago

For DUP or DEL, the information of REF and ALT can be determed by other columns. But in common VCF format, REF and ALT must be specific. Some of sv-callers also generate VCFs without REF and ALT info, so we write this script as transformation. But you may do some modification to apply to your dataset. DUP is a specific type of insertion.

lovelycatZ commented 6 months ago

Thank you for your time! I noticed that this script was customized for the internal callers and it's a little hard for me to fully understand script/long_caller_parser.pl as I am not familiar with PERL. This error that all the records were not merged was due to no sequences were provided for REF and ALT, right? What if I add the sequences? Is this step should be performed before or after the bcftools merge process? Would you please provide a VCF file that was processed by script/long_caller_parser.pl so I could make the correct format?

starskyzheng commented 6 months ago

yes, this process must be done before bcftools merge. VCF file could like this: https://samtools.github.io/hts-specs/VCFv4.2.pdf.

lovelycatZ commented 6 months ago

In some cases, a DEL or DUP has a length of more than 1000bp, should we also add detailed sequences for REF and ALT?

starskyzheng commented 6 months ago

yes.

lovelycatZ commented 6 months ago

Hello Zheng,

I'm sorry to have bothered you so long.

After I added the detailed sequences for REF, I got the results that seemed to be a little different than before. But there were still the same warnings as before in the process:

Done reading ref fasta read 195 contig / total 3099922541 bp
Warn! aln software(muscle) exit status(2)! redo! No. 1
Warn! aln software(muscle) exit status(2)! redo! No. 1
Warn! no alt[0] from aln software(HAlignC)! redo! No. 1
Warn! aln software(famsaP) exit status(1)! redo! No. 1
Warn! aln software(mafft) exit status(1)! redo! No. 1
Warn! aln software(muscle) exit status(2)! redo! No. 1
0 not in alts: $VAR1 = [];
 at /data/xiangxd/software/panpop/scripts/../lib/realign_alts.pm line 333.
    realign_alts::alt_alts_to_muts(undef, 3, ARRAY(0x2b76d14944a0)) called at /data/xiangxd/software/panpop/scripts/../lib/realign_alts.pm line 311
    realign_alts::process_alts(ARRAY(0x2b77a5d9f070), 3, 12452, undef, ARRAY(0x2b76d14944a0)) called at /data/xiangxd/software/panpop/bin/../scripts/realign.pl line 495
    main::process_line_new(ARRAY(0x2b77a5d9f070), HASH(0x2b77a59f9890), "chr16", 152760) called at /data/xiangxd/software/panpop/bin/../scripts/realign.pl line 446
    main::process_lines_new(ARRAY(0x2b77a531e2e8), "chr16", 152760, 165211) called at /data/xiangxd/software/panpop/bin/../scripts/realign.pl line 254
    main::mce_run(MCE=HASH(0x2b76d1074b70)) called at /data/xiangxd/perl5/lib/perl5/MCE/Core/Worker.pm line 489
    MCE::_worker_do() called at /data/xiangxd/perl5/lib/perl5/MCE/Core/Worker.pm line 593
    MCE::_worker_loop() called at /data/xiangxd/perl5/lib/perl5/MCE/Core/Worker.pm line 714
    MCE::_worker_main() called at /data/xiangxd/perl5/lib/perl5/MCE.pm line 2057
    MCE::_dispatch() called at /data/xiangxd/perl5/lib/perl5/MCE.pm line 2101
    MCE::_dispatch_child() called at /data/xiangxd/perl5/lib/perl5/MCE.pm line 692
    MCE::spawn(MCE=HASH(0x2b76d1074b70)) called at /data/xiangxd/perl5/lib/perl5/MCE.pm line 991
    MCE::run() called at /data/xiangxd/perl5/lib/perl5/MCE/Flow.pm line 429
    MCE::Flow::run(CODE(0x2b76d147bb78), CODE(0x2b76d1467330)) called at /data/xiangxd/software/panpop/bin/../scripts/realign.pl line 193
Warn! aln software(muscle) exit status(2)! redo! No. 1
Warn! aln software(muscle) exit status(2)! redo! No. 1
Warn! aln software(muscle) exit status(2)! redo! No. 1
All Done

This time, I provided 3 VCF files which were from 3 different callers. But when I added more VCF files to the PART standalone script, this warning came out and the running program just got stuck:

0 not in alts: $VAR1 = [];
 at /data/xiangxd/software/panpop/scripts/../lib/realign_alts.pm line 333.
    realign_alts::alt_alts_to_muts(undef, 3, ARRAY(0x2b76d14944a0)) called at /data/xiangxd/software/panpop/scripts/../lib/realign_alts.pm line 311
    realign_alts::process_alts(ARRAY(0x2b77a5d9f070), 3, 12452, undef, ARRAY(0x2b76d14944a0)) called at /data/xiangxd/software/panpop/bin/../scripts/realign.pl line 495
    main::process_line_new(ARRAY(0x2b77a5d9f070), HASH(0x2b77a59f9890), "chr16", 152760) called at /data/xiangxd/software/panpop/bin/../scripts/realign.pl line 446
    main::process_lines_new(ARRAY(0x2b77a531e2e8), "chr16", 152760, 165211) called at /data/xiangxd/software/panpop/bin/../scripts/realign.pl line 254
    main::mce_run(MCE=HASH(0x2b76d1074b70)) called at /data/xiangxd/perl5/lib/perl5/MCE/Core/Worker.pm line 489
    MCE::_worker_do() called at /data/xiangxd/perl5/lib/perl5/MCE/Core/Worker.pm line 593
    MCE::_worker_loop() called at /data/xiangxd/perl5/lib/perl5/MCE/Core/Worker.pm line 714
    MCE::_worker_main() called at /data/xiangxd/perl5/lib/perl5/MCE.pm line 2057
    MCE::_dispatch() called at /data/xiangxd/perl5/lib/perl5/MCE.pm line 2101
    MCE::_dispatch_child() called at /data/xiangxd/perl5/lib/perl5/MCE.pm line 692
    MCE::spawn(MCE=HASH(0x2b76d1074b70)) called at /data/xiangxd/perl5/lib/perl5/MCE.pm line 991
    MCE::run() called at /data/xiangxd/perl5/lib/perl5/MCE/Flow.pm line 429
    MCE::Flow::run(CODE(0x2b76d147bb78), CODE(0x2b76d1467330)) called at /data/xiangxd/software/panpop/bin/../scripts/realign.pl line 193

In addition to that, there were too many ./. GT tag in 3.final.vcf, is that correct? Here's attached a subsample and reduced version.

Many thanks! 3.final.zip

starskyzheng commented 6 months ago

Your input VCF is in wrong format. For deletion, REF is the SV sequence, ALT is * or the first base of REF For DUP or insertion, REF is one base of reference genome of this position of mutation. ALT is the SV sequence.

lovelycatZ commented 6 months ago

Great thanks for your help! I've successfully run both two steps of the PART standalone script and got the result without any errors and warnings. I noticed that the INFO tag ZORIPOS=XXX;ZORIEND=XXX;ZLEN=XXX was within some records but not all. And some GT tag was "./." How can we interpret the result? part.zip

starskyzheng commented 6 months ago

As for SVs you can use bin/Fill_DP.pl. However, your VCF is for CNVs, I don't know how to deal with it.

lovelycatZ commented 6 months ago

I have built the conda enviroment according to the guide, but it get wrong for perl bin/Fill_DP.pl: Can't locate zzIO.pm in @INC (you may need to install the zzIO module) (@INC contains: /data/xiangxd/perl5/lib/perl5/x86_64-linux-thread-multi /data/xiangxd/perl5/lib/perl5 /data/xiangxd/software/miniconda3/envs/panpop/lib/site_perl/5.26.2/x86_64-linux-thread-multi /data/xiangxd/software/miniconda3/envs/panpop/lib/site_perl/5.26.2 /data/xiangxd/software/miniconda3/envs/panpop/lib/5.26.2/x86_64-linux-thread-multi /data/xiangxd/software/miniconda3/envs/panpop/lib/5.26.2 .) at bin/Fill_DP.pl line 26. BEGIN failed--compilation aborted at bin/Fill_DP.pl line 26. I can't even find the zzIO module on the PREL website.

starskyzheng commented 6 months ago

Fixed. Please try again. (bffe3f0) However, I don't think this script can handle your VCF properly.

lovelycatZ commented 6 months ago

In fact, I just want to know which records in the VCF file were merged by PART and which are not.

starskyzheng commented 6 months ago

Sorry, PanPop do not support this function.

starskyzheng commented 6 months ago

If no more activity, this issue will be closed.