stat-lab / EvalSVcallers

Evaluate the performances (precision and recall) of structural variation (SV) callers
32 stars 13 forks source link

INS Evaluation using MELT #10

Open Mkddb opened 3 years ago

Mkddb commented 3 years ago

Dear Team, Thanks a lot for this wonderful comprehensive evaluation.

Presently, i am trying to evaluate MELT for MEI analysis. Unfortunately, preferred MELT version was not available and i had to use recent MELT version 2.2.0. this worked nicely for the Simulated genome, but i am not able to get the final evaluation data for Individual subtypes. rather it shows for "INS". Output is as following: Ref-ALU: 480 Ref-L1: 94 Ref-SVA: 37 Ref-HERVK: 40

<< Sim-MEI >>

INS

        <Number of supporting reads>
        2   3   4   5   6   7   8   9   10  12

Call (A) 477 476 472 463 458 448 436 422 401 346 Recall (A) 69.4 69.2 69.2 68.6 68.2 67.2 65.8 63.9 60.6 52.3
Precis (A) 94.7 94.7 95.5 96.5 96.9 97.7 98.3 98.5 98.5 98.5

Can you suggest a way to refine the output for individual types of MEI as given in supplementary Table ?

Second problem : i am unable to calculate the precision recall value for Real Genome dataset 4, in terms of reads ranging from 2 to 12. provisional number of reference MEIs is given as 1350, but at what number of supporting reads ?. how about the same parameter at different reads ? also, it would be helpful to share how you got to number 1350 from 1000 genome dataset. can you suggest the process or method to reproduce this dataset ? Our data doesn't look clean, check it below for the final evaluation of MEI using Real dataset 4

Ref-DEL: 9176 tinny (<= 100 bp): 1192 short (<= 1.0 kp): 5558 middle (<= 100 kb): 2330 large (> 100 kb): 96 Ref-INS: 13669 short (<=1.0 kp): 12660 middle (<= 100 kb): 1008 large (> 100 kb): 1 Ref-DUP: 2604 short (<=1.0 kp): 873 middle (<= 100 kb): 1585 large (> 100 kb): 146 Ref-INV: 274 short (<=1.0 kp): 26 middle (<= 100 kb): 200 large (> 100 kb): 48 << NA12878_DGV-2016_LR-assembly >>

INS

        <Number of supporting reads>
        2   3   4   5   6   7   8   9   10  12

Call (A) 2048 1913 1644 1501 1411 1333 1253 1164 1048 807 Recall (A) 8.9 8.9 8.8 8.7 8.5 8.3 7.9 7.3 6.6 5
Precis (A) 59.8 63.9 73.7 79.2 82.8 85.2 86.2 86.6 86.8 85.1

stat-lab commented 3 years ago

I uploaded a revised evaluate_SV_callers.pl script. Try to use it with -r ME option. Please confirm your test vcf file has MEI subtypes (i.e., ALU/LINE1/SVA or HERVK) at the third column.

We took 2 reads as the optical number of supporting reads (RSS) for MEI as judged from the results with the simulated data. But this value may change depending on whether the researchers make much account of precision or recall. The provisional MEI number per genome, 1350, was derived from the 1000G result (Nature 526, 2015), where 1,218 MEIs per individual had been identified with the overall sensitivity ranging 83-96%.

Mkddb commented 3 years ago

Great, It works fine with the revised evaluate_SV_callers.pl script. i am able to get the MEI subtypes data now.

< Parameter: Min SV length: 50, Allowed BP diff: 125, Ref-SV: MEI > Ref-ALU: 480 Ref-L1: 94 Ref-SVA: 37 Ref-HERVK: 40 << Sim-MEI >>

INS

        <Number of supporting reads>
        2   3   4   5   6   7   8   9   10  12

Call (A) 477 476 472 463 458 448 436 422 401 346 Recall (A) 69.4 69.2 69.2 68.6 68.2 67.2 65.8 63.9 60.6 52.3
Precis (A) 94.7 94.7 95.5 96.5 96.9 97.7 98.3 98.5 98.5 98.5

ALU

Call (A) 375 374 370 361 358 351 341 328 312 262 Recall (A) 74.1 73.9 73.9 73.1 72.9 71.8 70.4 67.7 64.3 53.9
Precis (A) 94.9 94.9 95.9 97.2 97.7 98.2 99.1 99 99 98.8

LINE1

Call (A) 66 66 66 66 64 61 59 58 56 53
Recall (A) 63.8 63.8 63.8 63.8 61.7 60.6 58.5 58.5 56.3 54.2
Precis (A) 90.9 90.9 90.9 90.9 90.6 93.4 93.2 94.8 94.6 96.2

SVA

Call (A) 36 36 36 36 36 36 36 36 33 31
Recall (A) 97.2 97.2 97.2 97.2 97.2 97.2 97.2 97.2 89.1 83.7
Precis (A) 100 100 100 100 100 100 100 100 100 100

HERVK

Call (A) 0 0 0 0 0 0 0 0 0 0
Recall (A) 0 0 0 0 0 0 0 0 0 0
Precis (A) 0 0 0 0 0 0 0 0 0 0

Though, there are minor differences in the statistical values. do you see any specific reason for that ? also, is this normal and may i go ahead with these value as a reproducible data with yours ?