smehringer / SViper

Swipe your Structural Variants called on long (ONT/PacBio) reads with short exact (Illumina) reads.
BSD 3-Clause "New" or "Revised" License
32 stars 8 forks source link

Able to deal with the VCF file produced by PBSV? #14

Closed JunpengShi closed 4 years ago

JunpengShi commented 5 years ago

Dear Svenja,

Is SViper able to deal with the VCF format procuded by PBSV as follows: chr1 25350 pbsv.DEL.1 CATGTCATCAGAGTGGGGCTGAAGCAGCCCG C . PASS SVTYPE=DEL;END=25380;SVLEN=-30 GT:AD:DP:SAC 0/1:34,11:45:18,16,8,3 chr1 26780 pbsv.DEL.2 TAGAGTCTTGAGCAAAATCTG T . PASS SVTYPE=DEL;END=26800;SVLEN=-20 GT:AD:DP:SAC 0/1:22,27:49:14,8,15,12 chr1 27914 pbsv.DEL.3 CGAACGGACGAACGATCGAAT C . PASS SVTYPE=DEL;END=27934;SVLEN=-20 GT:AD:DP:SAC 0/1:19,13:32:9,10,10,3 chr1 113278 pbsv.INS.44 G GTGATGATTAATGGGATGAGG . PASS SVTYPE=INS;END=113278;SVLEN=20 GT:AD:DP:SAC 0/1:4,14:18:1,3,7,7 chr1 128621 pbsv.INS.45 C CCGCCGTGCCGCCATTATCGCCG . PASS SVTYPE=INS;END=128621;SVLEN=22 GT:AD:DP:SAC 0/1:4,3:7:2,2,1,2

I noticed the requirement of SViper for vcf file that (tags instead of sequences, e.g. ). While is it required for all the SVs to have a tag as or ? Do you wish to add support for the vcf file from PBSV?

Best regards, Junpeng

smehringer commented 5 years ago

Hi @JunpengShi,

thanks for the interest in SViper! Unfortunately, it is currently not supported but I'll try to add this feature soon if you need it?

Best, Svenja

JunpengShi commented 5 years ago

It will be great if PBSV format can be added into SViper.

Let me known if you need test files like PBSV vcf file, long reads bam files or Illumina bam files.

Thank you very much!

smehringer commented 5 years ago

Hi @JunpengShi,

I added a branch with the first scratch of an implementation.

It would be great if you have a small test data set that I could test the code with!

Best, Svenja

JunpengShi commented 5 years ago

Hi Svenja,

I have send you the test files through email due to the file size limit of GitHub.

The title is "Test files for the PBSV feature of SViper" and my email address is shijunpeng@cau.edu.cn.

Best, Junpeng

smehringer commented 5 years ago

Hi @JunpengShi,

I have updated the feature branch. It works on your example file now. Can you checkout the branch, try out the feature and report back to me before I merge it to master?

Thanks, Svenja

JunpengShi commented 5 years ago

Hi Svenja,

Perfect! I have locally installed the PBSV feature branch. It works with the complete data which contained >300,000 insertions and deletions.

By the way, I found some small bugs that might be easily addressed to further improve this branch.

  1. The header lines are malformed which lost the right ">" in the following lines, which cause errors when loading into genome browsers like IGV:

    FILTER=<ID=FAIL5,Description="The variant was polished away.

    FILTER=<ID=FAIL6,Description="The variant reference name does not exist in the short read BAM file.

    FILTER=<ID=FAIL7,Description="The variant reference name does not exist in the long read BAM file.

  2. The coordinate of some SVs have been changed after polish, leading to regional unsorted coordinates even though the original coordinates were sorted in PBSV output.

  3. Some variant QUALs are smaller than 0 which also report bugs when loading into IGV. Since it defines the -log10(P-error_calling), it should be always larger than 0 in vcf format. Maybe you add some adjustment of QUALs in SViper?

Thank you for your great efforts to improve SViper. Junpeng

smehringer commented 4 years ago

Hi @JunpengShi,

Thank you for your Feeback! I addressed 1. and 3. in the branch since they were easy fixes.

I'll keep 2. in mind and maybe provide an option that will sort the VCF file again, but currently, it is also nice to have every variant in the same line es before when comparing the breakpoints before and after to a gold set.

Best, Svenja

smehringer commented 4 years ago

Feel free to reopen this or open another issue if anything else comes up!