stat-lab / EvalSVcallers

Evaluate the performances (precision and recall) of structural variation (SV) callers
32 stars 13 forks source link

convert_SV_callers_vcf.pl fail to convert my delly and lumpy sv.out #17

Open tyyiyi opened 2 years ago

tyyiyi commented 2 years ago

Hi,

I am using the convert_SV_callers_vcf.pl for Delly and Lumpy(smoove) output files (BCF and VCF). It does not work for Delly and Lumpy output files. the files are empty or damaged files.

I am not sure what went wrong. Any help would be appreciated.

delly returned log file image

I use smoove to run lumpy(It is recommended on https://github.com/arq5x/lumpy-sv),returned log file: image …… image

Thanks!

stat-lab commented 2 years ago

For Delly, you have to set the path of bcftools in your $PATH environmental variable to use the convert_DELLY_vcf.pl script with your Delly output bcf files. Alternatively, you can use vcf files converted from Delly output bcf files with bcftools. For Lumpy, you have to use a Lumpy output vcf file not a bedpe file as an input file of the convert_Lumpy_vcf.pl script. Do not use the -bedpe option for the Lumpy command.

tyyiyi commented 2 years ago

Thank you for your suggestion. I used a tool called smoove to run lumpy. Its command is: $smoove call --outdir $smooveout --exclude $bed --name $fname --fasta $ref -p 1 --genotype $duplicate/$i image so, I did not use the -bedpe option. the result I got is: image I used the file called simachr17_f10_m{300}-smoove.genotyped.vcf.gz as the input file.

stat-lab commented 2 years ago

Your Lumpy output vcf is gzip compressed. I edited the convert_Lumpy_vcf.pl script to be able to use gzip vcf files. Try to use the new script.

tyyiyi commented 2 years ago

Thank you for your help, I will try again and give you feedback.

tyyiyi commented 2 years ago

Hi, I used vcf files converted from Delly output bcf files with bcftools and then used convert_SV_callers_vcf.pl.However, the obtained file is still empty. The input file are: image

The log file shows: image

I also got the same empty result in lumpy after I used the new script. The log file shows: image

stat-lab commented 2 years ago

Could you show the first 5 lines, except for headers, of the vcf-converted files of Delly and Lumpy outputs, respectively?

tyyiyi commented 2 years ago

delly vcf file: image

lumpy output: image

stat-lab commented 2 years ago

Are your SV calls only from decoy references? Our evaluation scripts delete SV calls from non-general chromosomes, except for the human chr1-chr22, X, and Y. If you need include all variants, you have to edit downloaded scripts, convert_Delly_vcf.pl and convert_Lumpy_vcf.pl by commenting out (add # at the line head) the line, 'next if ($chr !~ /^chr/) and ($chr !~ /^[\dXY]+$/);'.

tyyiyi commented 2 years ago

I am using GRCh37 as a reference genome. His chromosome names are as follows. I only need 22 autosomes and X and Y chromosomes. GCF_000001405.25_GRCh37.p13_genomic.fna.fai: image

stat-lab commented 2 years ago

I do not understand. The 1st column of the reference index (*.fa.fai) should be chromosome name. Why are not your chromosome names, 1-22, X or Y if using GRCh37?

tyyiyi commented 2 years ago

my GRCh37 is from NCBI as follow. image image

can I use hg19 as the reference genome when I align and call SV?

stat-lab commented 2 years ago

Please see https://www.ncbi.nlm.nih.gov/genome/?term=GRCh37 I recommend using 1000Genomes hs37d5.fa (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz).

tyyiyi commented 2 years ago

If I only need SVs on chromosomes 1-22 and X and Y chromosomes, can I use hg19 as a reference genome? In addition, if I select a chromosome (such as chr17) in the simulated genome (Sim-A) as a template, and then use the ART simulator to simulate the next-generation sequencing data of chr17. So can I only use chr17 of hg19/hs37d5 as the reference genome when aligning(BWA) and calling structural mutations?

stat-lab commented 2 years ago

Of course, you can use hg19. The advantage of using hs37d5.fa, which contains several decoy contig sequences that could not incorporate into the GRCh37 reference, is to increase short read mapping accuracy by mapping reads to decoy sequences. For the simulated experiments with chr17, you can use only chr17 as reference.

tyyiyi commented 2 years ago

Okay, thank you very much for your help, I will try to use hs37d5 to call sv again.

tyyiyi commented 2 years ago

Hi, I have successfully run evaluate_SV_callers.pl and got the results, but I have some doubts about the data. image such as DEL: The sum of Call(A)*Precise(A) in each column should be equal to the total number of DEL (REF), but this is not true in my data. Is my understanding wrong? call(A) is not the SV I called?

stat-lab commented 2 years ago

The total DEL call, Call (A) is the sum of Call (S), Call (M), and Call (L). The indicated precision and recall values are percentages, so the total true positive DELs is 64 * 0.92.

tyyiyi commented 2 years ago

I mean when supporting reads=2, TP=64 0.92, when supporting reads=3, TP=59 0.915...... Then each supporting reads corresponds to a TP value, and the sum of these values should not be less than the total Ref-DEL(104)? Or does 2,3,4...refer to supporting reads >=2,3,4?

stat-lab commented 2 years ago

The number of supporting reads (RSS) indicates the number of reads supporting the called SV (not a TP value). So higher RSSs give higher precision but lower recall.

tyyiyi commented 2 years ago

Okay, I understand. In addition, I tested the delly vcf file converted by convert_SV_callers_vcf.pl and excluded DEL SVs with READS=1 and READS=0. The result shows that there are 58 DELs, but evaluate_SV_callers.pl counts 64 DELs (above). What is the reason for this?

tyyiyi commented 2 years ago

Hi, I saw in the supplementary method that DELLY removed the SV without PASS when converting the file, but I looked at the delly vcf file converted by convert_SV_callers_vcf.pl, and it showed that it contained LowQual SV. image image Can you explain the filtering conditions in detail during the conversion process? Thanks in advance.

stat-lab commented 2 years ago

In our paper, we filtered out LowQual SVs of Delly calls, but the provide conversion script does not filter out the LowQual SVs because the retain or discard of such variants depends on the purpose of research. If you need to filter out such variants, you can easily remove them using awk or grep command.

tyyiyi commented 2 years ago

Thank you for your reply, but I found that there are 498 SVs called by delly, but only 257 after conversion, and only 165 after evaluating_SV_callers.pl.

delly call:498 image

convert:257 image

evaluate:64(DEL)+87(DUP)+14(INV)+32(READS=0)+34(READS=1)=231 image image

And, similar to delly, there are 370 SVs called by lumpy, but only 157 after conversion, and 223(85(DEL)+65(DUP)+4(INV)+69(READS=1)+0(READS=1)=223) after evaluating_SV_callers.pl. Why the sum of the data counted by evaluate_SV_callers.pl plus the data of READS=1 and READS=0 is greater than the number of SVs after conversion. can you give me a solution? Thank you!

tyyiyi commented 2 years ago

Hello, happy new year. When I ran evaluated_SV_callers.pl, I had some questions. I saw some variations as follows. The INV I detected is as follows: image The Ref-SV is as follows: image I found that some of the INV I detected and the INV gave in Ref-SV are different by a few bases, so when they compare with each other, what is the threshold for judging that it is a true positive SV? (The number of bases that differ between Ref-SV and the SV detected by the software)