stat-lab / EvalSVcallers

Evaluate the performances (precision and recall) of structural variation (SV) callers
32 stars 13 forks source link

EvalSVcallers

EvalSVcallers provides scripts for evaluating SV calling results obtained with NA12878 whole genome sequencing data or simulated WGS data. In addition, it also provides scripts for generating simulated mobile element insertion (MEI), nuclear mitochondrial insertion (NUMT), and viral element insertion (VEI) datasets.

The convert_SV_callers_vcf.pl script converts final output files from 69 existing tools to vcf files compatible to this evaluation system. Some variants with low quality or some SV types (BND, TRA) not treated in this evaluation will be filtered out at this step. The details of the filtering condition for each tool have been described in the Supplementary Methods of the paper (Genome Biology 20:117).

The evaluate_SV_callers.pl script in the script folder outputs precision and recall values of SV calls for each SV type and size range. By default, it evaluates SV calls obtained with the NA12878 WGS data. By specifying A, MEI, VEI, or MT with an option -r, it evaluates SV calls obtained with simulated data, including Sim-A, Sim-MEI, Sim-VEI, or Sim-NUMT. The reference SV vcf file for the NA12878 real data is NA12878_DGV-2016_LR-assembly.vcf in the Ref_SV folder. The other reference vcf files for the simulated datasets used in our study are Sim-A.SV.vcf, Sim-MEI.chr17.vcf, Sim-VEI.chr17.vcf, and Sim-NUMT.chr17.vcf in the Simulated_data_in_our_study folder. By specifying a reference SV vcf file with the -r option, users can evaluate SV calls obtained with specific WGS data corresponding to the specified reference SV data. The evaluation with this script can integrate SV calls with trio data by specifying with -p1 and -p2 options. In this case, when the calls that match the reference SVs exhibit Mendelian inheritance errors (calls found only in child but in neither parent), the corresponding calls are not regarded as true positive calls to calculate precision and recall.

The evaluate_SV_overlap_calls.pl script in the script folder outputs precision and recall values of overlap calls between a pair of tools for each SV type and size range It outputs the results for all the combinations of pairs of tools specified with the t option. This can also integrate SV calls with trio data by specifying -pv or -pd option. By specifying the -vcf option, it also outputs vcf file(s) for overlap SV calls for every pairs of tools.

The evaluation system has been developed with the GRCh37 reference, but the hg19-based SV calling results (the chromosome names with 'chr' prefix) can be used for the evaluation (2020.8.3).

Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, Kamatani Y. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biology 20(1):117 (2019). doi: 10.1186/s13059-019-1720-5.

[1] Evaluate SV calls

(a) Converting the output files for SV callers to a vcf format compatible to this analysis

scripts/convert_SV_callers_vcf.pl

[Example]

scripts/convert_SV_callers_vcf.pl -t Pindel pindel.DEL.out pindel.DUP.out > Pindel.vcf

[Options]

--tool or -t a tool name (69 tools used in our study) [mandatory] --class or -c specify when tool is Mobster or PBHoney (MEI|NUMT|VEI for Mobster, NGM for PBHoney-NGM) --len or -l minimum size (bp) of SV to be kept [default: 30] --xlen or -xl maximum size (bp) of SV to be kept [default: 20000000] --help or -h output help message

(b) Reporting precision and recall of SV calls from a single tool

scripts/evaluate_SV_callers.pl

[Example]

For NA12878 SVs

scripts/evaluate_SV_callers.pl -r N Pindel.NA78.vcf

For Sim-A SVs

scripts/evaluate_SV_callers.pl -r A Pindel.simA.vcf

For NA12878 SVs, integrating trio SV data

scripts/evaluate_SV_callers.pl -r N Pindel.NA78.vcf -p1 Pindel.parent1.vcf -p2 Pindel.parent2.vcf

[Options]

   --ref or -r <STR>        reference SV type (A|N), A: Sim-A, N: NA12878 [default: N]
   --sv_type or -st <STR>   SV type (ALL|DEL|DUP|INS|INV|TRA) if specifying multi type, e.g., DEL,INS [default: ALL]
   --parent1 or -p1 <STR>   SV vcf file of parent1 (optional)
   --parent2 or -p2 <STR>   SV vcf file of parent2 (optional)
   --ref2 or -r2 <STR>      reference SV vcf file (optional)
   --ref_dir or -rd <STR>   directory containing reference SV vcf files (optional)
   --chr or -c <STR>        target chromosome to be analyzed [all or chr name(s), e.g., 4,5,6,X; default: all]
   --region_bed or -rb <STR> bed file indicating the regions to be analyzed (optional)
   --len or -l <INT>        minimum size (bp) of SV [default: 50]
   --xlen or -xl <INT>      maximum size of SV [default: 2000000]
   --ref_len or -rl <INT>   minimum size of reference SV [default: 30]
   --ref_xlen or -rxl <INT> maximum size of reference SV [default: 2000000]
   --min_read or -mr <INT>  minimum number of reads supporting an SV [default: 0]
   --read_set or -rs <INT>  read set of minimum number of reads (1: 2,3,4,5,6,7,8,9,10,12; 2: 2,3,5,7,9,11,13,15,17,19,30,40) [default: 1]
   --min_ovl or -mo <FLOAT> minimum rate of reciprocal overlap between called non-INS-SVs and reference non-INS-SVs [default: 0.5 for NA12878, 0.8 for Sim-A > 1Kb SVs, 0.6 for Sim-A <= 1 Kb SVs]
   --min_ins or -mins <INT> maximum allowable length between the breakpoints of called INSs and reference INSs [default: 200]
   --eval_gt or -eg         determine genotype accuracy for each SV type with the Sim-A data [default: false]
   --eval_bp or -eb         determine breakpoint and SV size accuracy for each SV type with the Sim-A data [default: false]
   --ins or -i              evaluate accuracy for INSs with called insertion size [default: false]
   --in_y or -y             include chrY [default: false]
   --help or -h             output help message

(c) Reporting precision and recall of overlap SV calls from pair(s) of tools

scripts/evaluate_SV_overlap_calls.pl -t # RSS: minimum number of reads supporting an SV

The input directory (the working directory by default) must contain vcf files to be tested. The vcf file names must start with tool name (e.g., Delly.*.vcf).

[Example]

For overlap calls of NA12878 from BreakDancer-DELLY, BreakDancer-Manta, DELLY-Manta pairs

scripts/evaluate_SV_overlap_calls.pl -r N -t BreakDancer:4 DELLY:3 Manta:3 -p BD-DELLY-Manta-overlap (the worling directory mus contain BreakDancer..vcf, Delly..vcf and Manta.*.vcf)

For overlap calls of NA12878 from BreakDancer-DELLY, BreakDancer-Manta, DELLY-Manta pairs, integrating trio data

scripts/evaluate_SV_overlap_calls.pl -r N -t BreakDancer:4 DELLY:3 Manta:3 -pv BreakDancer.parent1.vcf BreakDancer.parent2.vcf DELLY.parent1.vcf DELLY.parent2.vcf Manta.parent1.vcf Manta.prent2.vcf -p BD-DELLY-Manta-overlap-trio

For outputting vcf file(s) for overlap calls of NA12878 for each combination of tool pairs

scripts/evaluate_SV_overlap_calls.pl -r N -t BreakDancer:4 DELLY:3 Manta:3 -p BD-DELLY-Manta-overlap -vcf

(In this case, three vcf files showing overlapped calls between each combination of the three algorithms. In each vcf file, true positive calls that matched the reference SV data and false positive calls are labelled with 'TP' and 'FP', respectively, at the strat of each line.)

[Options]

   --ref or -r <STR>        reference SV type (A|N), A: Sim-A, N: NA12878 [default: N]
   --tools or -t <STR>      list of tool:read (e.g., Pindel:3 Lumpy:5)
   --sv_type or -st <STR>   SV type (ALL|DEL|DUP|INS|INV|TRA) [default: ALL]
   --sv_size or -ss <STR>   SV size, SS, S, M, L, Lo, or ALL, SS: 30-100 bp, S: 50-1000 bp, M: 1000-100,000 bp, Lo: 500-2,000,000 bp, L: 100,000-2,000,000 bp [default: ALL]
   --min_size or -s <INT>   minimum size (bp) of SV [default: 50]
   --chr or -c <STR>        target chromosome to be analyzed [all or chr name(s), e.g., 4,5,6,X; default: all]
   --read_len or -rl <INT>  read length [default: 125]
   --min_ovl or -mo <FLOAT> minimum rate of reciprocal overlap between called non-INS-SVs and reference non-INS-SVs [default: 0.5 for NA12878, 0.8 for Sim-A > 1Kb SVs, 0.6 for Sim-A <= 1 Kb SVs]
   --min_ovl2 or -mo2 <FLOAT> minimum rate of reciprocal overlap between non-INS-SVs from 2 datasets [default: 0.6]
   --min_ins or -mins <INT> maximum allowable length between overlap INS breakpoints [default: 200]
   --dir or -d <STR>        directory of input vcf files [default: ./]
   --parent_vcf or -pv <STR> list of parent vcf files (e.g., Pindel.mother.vcf Pindel.father.vcf Lumpy.mother.vcf Lumpy.father.vcf) [either specify -pd or -pv when using trio data]
   --parent_dir or -p <STR> directory containing parent vcf files (should not include child vcf files in this directory)
   --in_y or -y             include chrY [default: false]
   --prefix or -p <STR>     prefix of output
   --vcf_out or -vcf        output vcf file for overlapping calls between tool pairs [default: false]
   --help or -h             output help message

[2] Simulate MEI/NUMT/VEI diploid genomes (optimized for chr17 by default)

All the scripts introduce user-defined rates of artificial SNPs and 1- to 6-bp indels, in addition to MEIs, NUMTs, or VEIs, randomly into a reference genome.

(a) MEI simulator

The generate_sim_genome_MEI.pl script introduce user-defined numbers of AluY, LINE1, SVA, and HERVK mobile elements randomly into a reference genome. The MEI sequences to be introduced by default are derived from AluY, LINE1, SVA, and HERVK sequences from the hg19 chr1 (Alu_chr1_mi90mc10.fa, L1_chr1_mi90mc10.fa, SVA_chr1_mi90mc10.fa, and HERVK_chr1_mi90mc10.fa in the Simulated_data_resource folder), which had been obtained by searching using BLAST with min_ident: 90%, min_qcov: 10% for L1 and HERVK, 50% for SVA, and 70% for AluY. The MEI site to be introduced can be excluded from predefined ME regions (hs37_chr17_rmsk_MEI.bed for chr17, by default).

scripts/generate_sim_genome_MEI.pl [options] -r -p

[Example]

scripts/generate_sim_genome_MEI.pl -r chr17.fa -p Sim-MEI -s 0.01 -i 0.002 -an 200 -ln 100 -sn 20 -hn 10 -ht 2

[Options]

--ref or -r reference fasta file --pref or -p prefix of outputfiles [default: out] --al_fa or -af fasta file of blast-searched Alu library to be introduced [default: Alu_chr1_mi90mc10.fa] --l1_fa or -lf fasta file of blast-searched L1 library to be introduced [default: L1_chr1_mi90mc10.fa] --sv_fa or -sf fasta file of blast-searched SVA library to be introduced [default: SVA_chr1_mi90mc10.fa] --he_fa or -hf fasta file of blast-searched HERVK library to be introduced [default: HERVK_chr1_mi90mc10.fa] --me_bed or -mb bed file of mobile elements [i.e., Alu, L1, SVA, and HERVK] defined for the hs37 reference (optional) [default: hs37_chr17_rmsk_MEI.bed] --snp or -s genomic mutation rate for SNPs [default: 0.01] --indel or -i genomic mutation rate for indels [default: 0.002] --alu_num or -an number of introducing Alu [default: 300] --l1_num or -ln number of introducing LINE1 [default: 70] --sva_num or -sn number of introducing SVA [default: 20] --her_num or -hn number of introducing HERVK [default: 20] --het_rate or -ht hetero to homo ratio of introducing variants [default: 2] --help or -h output help message

(b) NUMT simulator

The NUMT sequences to be introduced by default are from libraries of defined Numts (BB_NUMT.fasta in the Simulated_data_resource folder) from UCSC Genome Browser (BMC Bioinformatics 13, S15 2012). The NUMT site to be introduced can be excluded from predefined ME regions (BB_NUMT.vcf by default).

scripts/generate_sim_genome_NUMT.pl [options] -r -p

[Example]

scripts/generate_sim_genome_NUMT.pl -r chr17.fa -p Sim-NUMT -s 0.01 -i 0.002 -nn 100 -ms 150 -ht 2

[Options]

--ref or -r reference fasta file --pref or -p prefix of outputfiles [default: out] --nm_fa or -nf fasta file of defined NUMT sequences [default: BB_NUMT.fasta] --nm_vcf or -nv vcf file of NUMTs defined for the hs37 reference [default: BB_NUMT.vcf] --snp or -s genomic mutation rate for SNPs [default: 0.01] --indel or -i genomic mutation rate for indels [default: 0.002] --numt_num or -nn number of introducing NUMTs [default: 200] --min_size or -ms minimum size of introducing NUMTs [default: 100]; --het_rate or -ht hetero to homo ratio of introducing variants [default: 2] --help or -h output help message

(c) VEI simulator

The viral sequences to be introduce by default are from libraries (human_virus_ncbi.rn.fa in the Simulated_data_resource folder) obtained from NCBI (http://www.ncbi.nih.gov/genomes/viruses/).

scripts/generate_sim_genome_VEI.pl [options] -r -p

[Example]

scripts/generate_sim_genome_VEI.pl -r chr18.fa -p Sim-VEI -s 0.01 -i 0.002 -vn 100 -mr 0.1 -ht 2

[Options]

--ref or -r reference fasta file --pref or -p prefix of outputfiles [default: out] --lib or -l fasta file of virus sequence library [default: human_virus_ncbi.rn.fa] --virus_list or -vl list of virus name to be introduced into reference (e.g., papilloma herpes); all virus sequences in the library are candidates of insertions if not specified --virus_num or -vn number of virus segmanets to be introduced [default: 100] --min_len or -ml minimum length of virus fragment [default: 500] --max_len or -xl maximum length of virus fragment [default: 10000] --mut_rate or -mr maximum mutation rate of virus sequence [default: 0.05, which means that 1~5% mutations are introduced into a virus fragment] --snp or -s genomic mutation rate for SNPs [default: 0.01] --indel or -i genomic mutation rate for indels [default: 0.002] --het_rate or -ht hetero to homo ratio of introducing variants [default: 2] --help or -h output help message

[3] Simulated data used in our study (Kosugi et al., Comprehensive evaluation of structural variation callers for whole genome sequencing.)

(a) Sim-A

Simulated diploid genome: Simulated_data_in_our_study/Sim-A/Sim-A.diploid.fa.gz Simulated SVs in the genomes: Simulated_data_in_our_study/Sim-A/Sim-A.SV.vcf (3,530 DELs, 1,656 DUPs, 2,819 INSs, and 309 INVs (4,853 heterozygous SVs) had been introduced into the Sim-A diploid genome.)

(b) Sim-MEI

Simulated diploid genome: Simulated_data_in_our_study/Sim-MEI/Sim-MEI.chr17.diploid.fa.gz Simulated SVs in the genomes: Simulated_data_in_our_study/Sim-MEI/Sim-MEI.chr17.vcf (651 MEIs (429 heterozygous MEIs) had been introduced into the Sim-MEI diploid chr17 genome.)

(c) Sim-NUMT

Simulated diploid genome: Simulated_data_in_our_study/Sim-NUMT/Sim-NUMT.chr17.diploid.fa.gz Simulated SVs in the genomes: Simulated_data_in_our_study/Sim-NUMT/Sim-NUMT.chr17.vcf (132 NUMTs (132 heterozygous NUMTs) had been introduced into the Sim-NUMT diploid chr17 genome.)

(d) Sim-VEI

Simulated diploid genome: Simulated_data_in_our_study/Sim-VEI/Sim-VEI.chr17.diploid.fa.gz Simulated SVs in the genomes: Simulated_data_in_our_study/Sim-VEI/Sim-VEI.chr17.vcf (100 VEIs (64 heterozygous VEIs) had been introduced into the Sim-VEI diploid chr17 genome.)

Simulate paired-end reads using the ART simulator

The ART simulator is available at .

(For generating paired-end reads with the Sim-A diploid genome, Sim-A.diploid.fa)

The paired-end reads have 125 bp in length, 500 bp of insert size, 100 bp of insert size SD, and 30x coverage.

$ART_DIR: a directory containing ART executables.

$ART_DIR/art_illumina -sam -i Sim-A.diploid.fa -l 125 -f 15 -m 500 -s 100 -o Sim-A_30x -p -1 $ART_DIR/Illumina_profiles/HiSeq2500L125R1.txt -2 $ART_DIR/Illumina_profiles/HiSeq2500L125R2.txt