stat-lab / MOPline

Detection and genotyping of structural variants
MIT License
15 stars 5 forks source link

the real SV datasets of NA12878 and NA19240 #7

Open jingydz opened 9 months ago

jingydz commented 9 months ago

Hi, I want to know what files are the real SV datasets of NA12878 and NA19240 finally? And which one is GRCh38 version?

In your paper, you wrote:"The reference SVs for NA19240 (38,562) contained 1.5-fold more SVs than the reference SVs for NA12878 (25,736), probably due to the remaining redundant SVs in the HGSV data."

I downloaded the HGSGV data "variants_freeze4_sv_insdel_alt.vcf.gz". And $ bcftools view -s NA19240 --exclude-uncalled --threads 10 variants_freeze4_sv_insdel_alt.vcf.gz -Oz -o hgsvc.NA19240.vcf.gz $ zcat hgsvc.NA19240.vcf.gz |grep -v "^#" |grep -v "0|0" |wc -l 30630

I found NA19240 only had 20394 SVs, and HGSVC dataset only have INS and DEL, no other SV types.

$ bcftools view -s NA12878 --exclude-uncalled --threads 10 variants_freeze4_sv_insdel_alt.vcf.gz -Oz -o hgsvc.NA12878.vcf.gz $ zcat hgsvc.NA12878.vcf.gz |grep -v "^#" |grep -v "0|0" |wc -l 27538

In addition, I downloaded https://github.com/stat-lab/EvalSVcallers/blob/master/Ref_SV/NA12878_DGV-2015_LR-assembly.vcf and https://github.com/stat-lab/EvalSVcallers/blob/master/Ref_SV/NA12878_DGV-2016_LR-assembly.vcf

$ zcat NA12878_DGV-2015_LR-assembly-maybeHG19.vcf.gz |wc -l 24736

$ zcat NA12878_DGV-2016_LR-assembly-maybeHG38.vcf.gz |wc -l 25812

I noticed that these numbers don't match the ones you provided earlier. Can you please be more specific about how you processed them?

stat-lab commented 9 months ago

The reference SV sets of NA12878 and NA19240 to be used for the evaluation of SV detection tools (used for Fig. 2 and Figs. S1-S7) are GRCh37-based ones. As described in Methods of the paper, the NA12878 reference SV set was generated by combining the DGV variant data (2016-05-15 version) with long read-based SVs (Pendelton et al., Nat Methods 2015) without redundancy. For the NA19240 reference SV set, the DGV variants (2016-05-15 version) was combined with nstd152.GRCh37.variant_call.vcf.gz obtained at the NCBI dbVar site, which contained >= 30 bp variants.

jingydz commented 9 months ago

Could you please share the finally reference SV sets of NA12878 and NA19240? Thank you in advance.

stat-lab commented 9 months ago

The NA19240 reference sv vcf file was uploaded at https://github.com/stat-lab/EvalSVcallers/tree/master/Ref_SV. The reference vcf for NA12878 is NA12878_DGV-2016_LR-assembly.vcf at that site.

jingydz commented 9 months ago

Are both of these files (NA19240.nstd152.DGV.GRCh37.sv.min30.vcf, NA12878_DGV-2016_LR-assembly.vcf) based on GRCh37? If I need GRCh38, do I need to convert them to GRCh38?

jingydz commented 9 months ago

In your paper, you wrote:"The reference SVs for NA19240 (38,562) contained 1.5-fold more SVs than the reference SVs for NA12878 (25,736), probably due to the remaining redundant SVs in the HGSV data."

I checked the vcf file. $ zcat NA19240.nstd152.DGV.GRCh37.sv.min30.vcf.gz |wc -l 40961 $ zcat NA12878_DGV-2016_LR-assembly.vcf.gz |wc -l 25812

Could you tell me what processing you did afterwards to obtain the number mentioned in the paper?

stat-lab commented 9 months ago

Sorry for your confusion. The number of SV is that after filtering of > 2 Mb DELs/DUPs/INVs and DELs/DUPs overlapping with the GRCh37 gap regions, because our evaluation script removes these filtered variants from the reference SVs on evaluation (see https://github.com/stat-lab/EvalSVcallers/blob/master/scripts/evaluate_SV_callers.pl).

jingydz commented 9 months ago

Hi, I filtered the two files you provided according to your instructions, but why am I still not getting the same numbers? (except for INS). image

image