stat-lab / EvalSVcallers

Evaluate the performances (precision and recall) of structural variation (SV) callers
32 stars 13 forks source link

Filtering_Long Read Variants From NA12878_DGV-2016_LR-assembly.vcf #12

Closed Mkddb closed 3 years ago

Mkddb commented 3 years ago

Dear Team,

Is it feasible to remove the Long Read variants from the NA12878_DGV-2016_LR-assembly.vcf file and get the evaluation results only for DGV variants. Since, that's an obvious fact that long read willl detect more variants than the short read data. I am trying to have a fair comparison of SV calling tool from short read data (say Manta, Lumpy) and compare with the "becnhmarked Non-long read data only"

but, after removing long read variants from NA12878_DGV-2016_LR-assembly.vcf and running the evaluation command, i am getting the following error.

Command : perl evaluate_SV_callers_revisedscript23092020.pl -r N SRR1910373RGnewconverttrial1.vcf

Error : (For all the lines till last line 25603) . . . . . . Use of uninitialized value $line[7] in pattern match (m//) at evaluate_SV_callers.pl line 299, line 25601. Use of uninitialized value $type in string ne at evaluate_SV_callers.pl line 300, line 25601. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 293, line 25602. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 295, line 25602. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25602. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25602. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25602. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25602. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 297, line 25602. Use of uninitialized value $line[7] in pattern match (m//) at evaluate_SV_callers.pl line 299, line 25602. Use of uninitialized value $type in string ne at evaluate_SV_callers.pl line 300, line 25602. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 293, line 25603. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 295, line 25603. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25603. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25603. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25603. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25603. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 297, line 25603. Use of uninitialized value $line[7] in pattern match (m//) at evaluate_SV_callers.pl line 299, line 25603. Use of uninitialized value $type in string ne at evaluate_SV_callers.pl line 300, line 25603. Ref-DEL: 0 tinny (<= 100 bp): 0 short (<= 1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-INS: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-DUP: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-INV: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 << NA12878_DGV-2016_LR-assembly >>

and the output file only have this entry : Ref-DEL: 0 tinny (<= 100 bp): 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-INS: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-DUP: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-INV: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0

<< NA12878_DGV-2016_LR-assembly >>

Can you please help me fixing this issue ?

stat-lab commented 3 years ago

The reference SV vcf file must have a type of SV at the 3rd and the 8th column. Please see the original reference format. (Although the SV types at the 3rd and the 8th column are the same for the NA12878_DGV-2016_LR-assembly.vcf, several of other reference vcf files have different terms.)

2020/10/03 22:07、Mkddb notifications@github.comのメール:

Dear Team,

Is it feasible to remove the Long Read variants from the NA12878_DGV-2016_LR-assembly.vcf file and get the evaluation results only for DGV variants. Since, that's an obvious fact that long read willl detect more variants than the short read data. I am trying to have a fair comparison of SV calling tool from short read data (say Manta, Lumpy) and compare with the "becnhmarked Non-long read data only"

but, after removing long read variants from NA12878_DGV-2016_LR-assembly.vcf and running the evaluation command, i am getting the following error.

Command : perl evaluate_SV_callers_revisedscript23092020.pl -r N SRR1910373RGnewconverttrial1.vcf

Error : (For all the lines till last line 25603) . . . . . . Use of uninitialized value $line[7] in pattern match (m//) at evaluate_SV_callers.pl line 299, line 25601. Use of uninitialized value $type in string ne at evaluate_SV_callers.pl line 300, line 25601. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 293, line 25602. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 295, line 25602. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25602. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25602. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25602. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25602. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 297, line 25602. Use of uninitialized value $line[7] in pattern match (m//) at evaluate_SV_callers.pl line 299, line 25602. Use of uninitialized value $type in string ne at evaluate_SV_callers.pl line 300, line 25602. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 293, line 25603. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 295, line 25603. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25603. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25603. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25603. Use of uninitialized value $type2 in string eq at evaluate_SV_callers.pl line 296, line 25603. Use of uninitialized value $type in string eq at evaluate_SV_callers.pl line 297, line 25603. Use of uninitialized value $line[7] in pattern match (m//) at evaluate_SV_callers.pl line 299, line 25603. Use of uninitialized value $type in string ne at evaluate_SV_callers.pl line 300, line 25603. Ref-DEL: 0 tinny (<= 100 bp): 0 short (<= 1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-INS: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-DUP: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-INV: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 << NA12878_DGV-2016_LR-assembly >>

and the output file only have this entry : Ref-DEL: 0 tinny (<= 100 bp): 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-INS: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-DUP: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0 Ref-INV: 0 short (<=1.0 kp): 0 middle (<= 100 kb): 0 large (> 100 kb): 0

<< NA12878_DGV-2016_LR-assembly >>

Can you please help me fixing this issue ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/12, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3ELADL5GDSVD2JJMPLSI4OX3ANCNFSM4SCUWGAQ.

Mkddb commented 3 years ago

Thanks for your kind reply.

I am using the same NA12878_DGV-2016_LR-assembly.vcf file provided in the Github package. it is working fine in its original format. I am simply ommiting or say deleting a few entries related to Source=Long Read. but, even deleting a single entry (any of it) is throwing me the above error. despite having Type of SV information at 3rd and 8th column both.

I am using the same scripts and same files. ommiting any entries in NA12878_DGV-2016_LR-assembly.vcf is throwing the error.

Can you please try to see this at your end.

stat-lab commented 3 years ago

Is the name of your edited reference vcf file the same as NA12878_DGV-2016_LR-assembly.vcf file and placed on the same directory as the original one? If not, specify the edited file with the -r2 option.

2020/10/04 14:07、Mkddb notifications@github.comのメール:

Thanks for your kind reply.

I am using the same NA12878_DGV-2016_LR-assembly.vcf file provided in the Github package. it is working fine in its original format. I am simply ommiting or say deleting a few entries related to Source=Long Read. but, even deleting a single entry (any of it) is throwing me the above error. despite having Type of SV information at 3rd and 8th column both.

I am using the same scripts and same files. ommiting any entries in NA12878_DGV-2016_LR-assembly.vcf is throwing the error.

Can you please try to see this at your end.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/12#issuecomment-703203480, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3DRW67UGWPINU2PZ7TSI77HZANCNFSM4SCUWGAQ.

Mkddb commented 3 years ago

Yes,

I have both the files saved in the same folder..
NA12878_DGV-2016_LR-assembly.vcf NA12878_DGV-2016_LR-assembly_DGV.vcf (having none Long Read variants or any omitted entry)

I changed the path of NA12878_DGV-2016_LR-assembly.vcf file in the evaluate_SV_callers.pl script my $ref_sv_NA12878 = "/DGV_Trial1/Ref_SV/NA12878_DGV-2016_LR-assembly.vcf";

I tried the command.. perl evaluate_SV_callers.pl -r N diploidSV_converted.vcf it's giving me usual output as >> Ref-DEL: 9176 tinny (<= 100 bp): 1192 short (<= 1.0 kp): 5558 middle (<= 100 kb): 2330 large (> 100 kb): 96 Ref-INS: 13669 short (<=1.0 kp): 12660 middle (<= 100 kb): 1008 large (> 100 kb): 1 Ref-DUP: 2604 short (<=1.0 kp): 873 middle (<= 100 kb): 1585 large (> 100 kb): 146 Ref-INV: 274 short (<=1.0 kp): 26 middle (<= 100 kb): 200 large (> 100 kb): 48 << NA12878_DGV-2016_LR-assembly >>

DEL

        <Number of supporting reads>
        2   3   4   5   6   7   8   9   10  12

Call (A) 5991 5945 5890 5823 5732 5628 5498 5368 5216 4902
Recall (A) 41.1 41 41 40.9 40.6 40.3 39.9 39.5 38.8 37.3
Precis (A) 63 63.4 63.8 64.4 65.1 65.8 66.7 67.6 68.3 69.9
Call (SS) 2291 2268 2241 2210 2171 2118 2054 1983 1912 1765
Recall (SS) 60.4 60.4 60.4 60.2 59.9 59.4 59.1 58.1 57.1 55.1
Precis (SS) 40.3 40.7 41.2 41.6 42.1 42.8 43.7 44.5 45 46.6
Call (S) 2887 2866 2842 2806 2756 2707 2648 2598 2529 2401
Recall (S) 42.1 42 41.9 41.8 41.5 41.1 40.6 40.3 39.5 38
Precis (S) 74.1 74.5 75 75.8 76.7 77.4 78.3 79.2 79.9 81.3
Call (M) 771 769 766 766 765 763 756 748 736 697 Recall (M) 30.1 30.1 30 30 30 29.9 29.7 29.5 29.1 27.7
Precis (M) 90.6 90.8 90.8 90.8 90.8 90.8 91 91.1 91.5 91.9
Call (L) 42 42 41 41 40 40 40 39 39 39
Recall (L) 12.5 12.5 11.4 11.4 10.4 10.4 10.4 10.4 10.4 10.4
Precis (L) 26.1 26.1 24.3 24.3 22.5 22.5 22.5 23 23 23 and so on for DUP, INS, INV

but doing either of these commands is giving same error..

perl evaluate_SV_callers.pl -r N -r2 NA12878_DGV-2016_LR-assembly_DGV.vcf diploidSV_converted.vcf or perl evaluate_SV_callers.pl -r2 NA12878_DGV-2016_LR-assembly_DGV.vcf diploidSV_converted.vcf

Then i tried changing the default NA12878_DGV-2016_LR-assembly.vcf to NA12878_DGV-2016_LR-assembly_DGV.vcf and keeping only NA12878_DGV-2016_LR-assembly_DGV.vcf file in this folder. also, changing the default path as my $ref_sv_NA12878 = "/DGV_Trial1/NA12878_DGV-2016_LR-assembly_DGV.vcf";

Still getting the same error. Not able to understand if somehing is trivial in the script itself.

Shall i mail you these raw files to check the issue ??

stat-lab commented 3 years ago

Try to specify the absolute path of your NA12878_DGV-2016_LR-assembly_DGV.vcf file with the -r2 option. (i.e., evaluate_SV_callers.pl -r2 [your SV vcf to be evaluated])

2020/10/04 17:08、Mkddb notifications@github.comのメール:

Yes,

I have both the files saved in the same folder.. NA12878_DGV-2016_LR-assembly.vcf NA12878_DGV-2016_LR-assembly_DGV.vcf (having none Long Read variants or any omitted entry)

I changed the path of NA12878_DGV-2016_LR-assembly.vcf file in the evaluate_SV_callers.pl script my $ref_sv_NA12878 = "/DGV_Trial1/Ref_SV/NA12878_DGV-2016_LR-assembly.vcf";

I tried the command.. perl evaluate_SV_callers.pl -r N diploidSV_converted.vcf it's giving me usual output as >> Ref-DEL: 9176 tinny (<= 100 bp): 1192 short (<= 1.0 kp): 5558 middle (<= 100 kb): 2330 large (> 100 kb): 96 Ref-INS: 13669 short (<=1.0 kp): 12660 middle (<= 100 kb): 1008 large (> 100 kb): 1 Ref-DUP: 2604 short (<=1.0 kp): 873 middle (<= 100 kb): 1585 large (> 100 kb): 146 Ref-INV: 274 short (<=1.0 kp): 26 middle (<= 100 kb): 200 large (> 100 kb): 48 << NA12878_DGV-2016_LR-assembly >>

DEL

  <Number of supporting reads>
  2   3   4   5   6   7   8   9   10  12

Call (A) 5991 5945 5890 5823 5732 5628 5498 5368 5216 4902 Recall (A) 41.1 41 41 40.9 40.6 40.3 39.9 39.5 38.8 37.3 Precis (A) 63 63.4 63.8 64.4 65.1 65.8 66.7 67.6 68.3 69.9 Call (SS) 2291 2268 2241 2210 2171 2118 2054 1983 1912 1765 Recall (SS) 60.4 60.4 60.4 60.2 59.9 59.4 59.1 58.1 57.1 55.1 Precis (SS) 40.3 40.7 41.2 41.6 42.1 42.8 43.7 44.5 45 46.6 Call (S) 2887 2866 2842 2806 2756 2707 2648 2598 2529 2401 Recall (S) 42.1 42 41.9 41.8 41.5 41.1 40.6 40.3 39.5 38 Precis (S) 74.1 74.5 75 75.8 76.7 77.4 78.3 79.2 79.9 81.3 Call (M) 771 769 766 766 765 763 756 748 736 697 Recall (M) 30.1 30.1 30 30 30 29.9 29.7 29.5 29.1 27.7 Precis (M) 90.6 90.8 90.8 90.8 90.8 90.8 91 91.1 91.5 91.9 Call (L) 42 42 41 41 40 40 40 39 39 39 Recall (L) 12.5 12.5 11.4 11.4 10.4 10.4 10.4 10.4 10.4 10.4 Precis (L) 26.1 26.1 24.3 24.3 22.5 22.5 22.5 23 23 23 and so on for DUP, INS, INV

but doing either of these commands is giving same error..

perl evaluate_SV_callers.pl -r N -r2 NA12878_DGV-2016_LR-assembly_DGV.vcf diploidSV_converted.vcf or perl evaluate_SV_callers.pl -r2 NA12878_DGV-2016_LR-assembly_DGV.vcf diploidSV_converted.vcf

Then i tried changing the default NA12878_DGV-2016_LR-assembly.vcf to NA12878_DGV-2016_LR-assembly_DGV.vcf and keeping only NA12878_DGV-2016_LR-assembly_DGV.vcf file in this folder. also, changing the default path as my $ref_sv_NA12878 = "/DGV_Trial1/NA12878_DGV-2016_LR-assembly_DGV.vcf";

Still getting the same error. Not able to understand if somehing is trivial in the script itself.

Shall i mail you these raw files to check the issue ??

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/12#issuecomment-703219463, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3ENH4OCYP3ZMWNS4UDSJAURPANCNFSM4SCUWGAQ.

Mkddb commented 3 years ago

Yes, i tried that also. Still getting the same error.

Main reason i am unable to find the root cause is "even deleting a single entry from the Reference SV vcf file is giving this error".

May i send you those raw files on mail and can you check, if it requires some modifications in the *VCF file or the script itself ?

stat-lab commented 3 years ago

I did the same thing with a modified ref SV vcf, where a single line had been deleted, but the run completed without any errors. The reference vcf files in the package is the same ones that I have in my server. Could you try to do with the original reference vcf file (NA12878_DGV-2016_LR-assembly.vcf) in the scripts folder as follows?

“./evaluate_sv_callers.pl -r2 [your vcf to be evaluated]

If this run do well, the similar run even with a modified ref vcf should complete.

Alternatively, does your edited vcf file contain ‘CR’ at the line end? All the files used in this system should contain only ‘LF’ at every line.

2020/10/05 13:48、Mkddb notifications@github.comのメール:

Yes, i tried that also. Still getting the same error.

Main reason i am unable to find the root cause is "even deleting a single entry from the Reference SV vcf file is giving this error".

May i send you those raw files on mail and can you check, if it requires some modifications in the *VCF file or the script itself ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/12#issuecomment-703393703, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3EWSAGOIFKVVANXI6TSJFF25ANCNFSM4SCUWGAQ.

Mkddb commented 3 years ago

Thanks a lot for your kind replies.

Finally, i figured out the issue. Yes, You are right. everything seems to work right with the script and ref SV files.

Actually, i was opening the "NA12878_DGV-2016_LR-assembly.vcf" file in LibreOffice with columns seperator options as "Tab, Comma and Semi-colon" and then filtering the "SOURCE=LONG_READ" or deleting an entry and saving the file in same format. Everything appeared normal because Unix only uses LF control characters. but, running evaluation script on this modified file with -r2 was giving error.

Screenshot from 2020-10-05 19-08-24

Then, i opened the same file in same LibreOffice, but without a Seperator options. Removed a few entries, saved it in default format and provided this file as -r2 option and it worked nicely. also, removed entries didn't show up in final numbers. Though, it appears that i might need to manually remove all the "SOURCE=LONG_READ" but that's okay as long as the analysis is working. this would make things easier for me, if you can suggest or share a smart script to filter our Long_read variants from Ref_SV file. Screenshot from 2020-10-05 19-08-33

Thanks a lot for your valuable inputs and comments.

stat-lab commented 3 years ago

You can do it simply with a command ‘grep SOURCE=DGV NA12878_DGV-2016_LR-assembly.vcf > NA12878_DGV-2016_LR-assembly.onlyGDV.vcf'.

2020/10/05 23:26、Mkddb notifications@github.comのメール:

Thanks a lot for your kind replies.

Finally, i figured out the issue. Yes, You are right. everything seems to work right with the script and ref SV files.

Actually, i was opening the "NA12878_DGV-2016_LR-assembly.vcf" file in LibreOffice with columns seperator options as "Tab, Comma and Semi-colon" and then filtering the "SOURCE=LONG_READ" or deleting an entry and saving the file in same format. Everything appeared normal because Unix only uses LF control characters. but, running evaluation script on this modified file with -r2 was giving error.

https://user-images.githubusercontent.com/61584787/95091874-9db53080-0744-11eb-993f-10ab26c36541.png Then, i opened the same file in same LibreOffice, but without a Seperator options. Removed a few entries, saved it in default format and provided this file as -r2 option and it worked nicely. also, removed entries didn't show up in final numbers. Though, it appears that i might need to manually remove all the "SOURCE=LONG_READ" but that's okay as long as the analysis is working. this would make things easier for me, if you can suggest or share a smart script to filter our Long_read variants from Ref_SV file. https://user-images.githubusercontent.com/61584787/95091893-a3ab1180-0744-11eb-93f0-4bf3a48b7499.png Thanks a lot for your valuable inputs and comments.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/12#issuecomment-703667946, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3EYTU2KEPA7S7FIM73SJHJTDANCNFSM4SCUWGAQ.

Mkddb commented 3 years ago

Great, Thanks for the help. Marking it as closed now.