Closed wbsimey closed 10 months ago
I am analyzing whole genome data for 60 mammalian samples (2.3Gb ea). We used bcf to call variants. My vcf.gz is 32G. Is there a way to multithread the package? PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3526967 bsimison 20 0 303.4g 146.9g 64.0g S 226.8 7.3 48:17.37 R
Hi Brian, I'm glad you were able to install the developmental version successfully, which actually has more of the latest changes to the package. Regarding your question about multithreading the analysis, currently, we're working on different methods to implement parallelization of the package. Still, unfortunately, it is not yet included in either of the versions available online.
32 GB does sound like a big vcf file. However, we've managed to work on vcf files of up to 60 gb on a server with RAM of 252 gb. So, my first suggestion is to try the analysis on a computer cluster/server with more processing power. In this case, you might want to change the memory allocation for R if it is insufficient (R --max-mem-size=Inf).
My second suggestion is to split your vcf file into smaller chunks (of SNPs), generate the allele depth and genotype tables normally using hetTgen(), save them separately, and then merge them in a separate session with rbind(), after which you can continue the analysis following the workflow. Most functions accept a data frame with the correct columns, including relatedness and h.zygosity functions (please check the help pages). And as you can imagine, if you can import only the filtered version of the vcfs, it will be even faster.
My third suggestion is to import the genotype and allele depth tables to R instead of the vcf file. If you look at the output of hetTgen() function, the data frame format looks like this "CHROM | POS | ID | ALT | sample1 | sample2 etc.". Therefore, you can use a program with faster languages like vcftools or gatK to extract the depth and genotype tables and import them using generic functions like read.table() or data.table::fread(), after which you can continue with the analysis.
As I mentioned earlier, we're still working on parallelizing the analysis for using a multithreaded approach and we will announce it once it's been fully tested. Apologies for the inconvenience and thank you for using our package.
thank you for the suggestions. Much appreciated. I got stuck at the relatedness step. I fear it is format issue. The following is what I have done so far:
> vcf.file.path <- "/Vcfs/Nf_ref/Nfusc_ref_60samp_bisnps_no_rpt.chr-only.vcf.gz"
> vcf <- readVCF(vcf.file.path,verbose = FALSE)
|--------------------------------------------------|
|==================================================|
> vcf$vcf[1:10,1:10]
#CHROM POS ID REF ALT QUAL FILTER
1: Chr1_nf 223 . A G 12.14790 PASS
2: Chr1_nf 227 . G A 3.52438 PASS
3: Chr1_nf 253 . T C 718.17200 PASS
4: Chr1_nf 1133 . A C 72.98750 PASS
5: Chr1_nf 1213 . T A 11.86950 PASS
6: Chr1_nf 1261 . C T 1997.20000 PASS
7: Chr1_nf 1262 . A G 5517.70000 PASS
8: Chr1_nf 1272 . T A 61.08510 PASS
9: Chr1_nf 1281 . G A 1693.36000 PASS
10: Chr1_nf 1291 . A C 4326.44000 PASS
INFO
1: DP=243;SGB=-0.62851;RPBZ=-1.04955;MQBZ=0.149616;MQSBZ=1.68224;BQBZ=0.561525;SCBZ=-0.194582;MQ0F=0;AC=1;AN=88;DP4=70,66,1,0;MQ=59;NS=44;AF=0.0113636;MAF=0.0113636;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
2: DP=236;SGB=-0.62851;RPBZ=-0.560139;MQBZ=0;MQSBZ=0;BQBZ=0.539117;SCBZ=-0.197596;MQ0F=0;AC=1;AN=88;DP4=69,63,1,0;MQ=60;NS=44;AF=0.0113636;MAF=0.0113636;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
3: DP=199;VDB=0.618714;SGB=22.5396;RPBZ=-1.88304;MQBZ=0.497183;MQSBZ=-1.00905;BQBZ=-0.604236;SCBZ=0.297817;MQ0F=0;AC=8;AN=84;DP4=48,41,8,14;MQ=59;NS=42;AF=0.0952381;MAF=0.0952381;AC_Het=4;AC_Hom=4;AC_Hemi=0;HWE=0.0296622;ExcHet=0.999217
4: DP=406;VDB=0.00701511;SGB=11.3066;RPBZ=-2.58897;MQBZ=0;MQSBZ=0;BQBZ=-3.80939;SCBZ=-0.269644;MQ0F=0;AC=1;AN=100;DP4=110,113,2,2;MQ=60;NS=50;AF=0.01;MAF=0.01;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
5: DP=317;SGB=-0.62851;RPBZ=1.18978;MQBZ=0.0743294;MQSBZ=-0.895533;BQBZ=0.650265;SCBZ=-0.105408;MQ0F=0;AC=1;AN=90;DP4=80,101,1,0;MQ=59;NS=45;AF=0.0111111;MAF=0.0111111;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
6: DP=350;VDB=0.994909;SGB=31.8732;RPBZ=0.641388;MQBZ=0;MQSBZ=0;BQBZ=0.0658554;SCBZ=-0.209147;MQ0F=0;AC=25;AN=96;DP4=77,74,27,20;MQ=60;NS=48;AF=0.260417;MAF=0.260417;AC_Het=5;AC_Hom=20;AC_Hemi=0;HWE=1.4163e-06;ExcHet=1
7: DP=352;VDB=0.999629;SGB=38.809;RPBZ=1.2117;MQBZ=0;MQSBZ=0;BQBZ=0.132143;SCBZ=1.46634;MQ0F=0;AC=74;AN=96;DP4=28,31,77,63;MQ=60;NS=48;AF=0.770833;MAF=0.229167;AC_Het=6;AC_Hom=68;AC_Hemi=0;HWE=3.55894e-05;ExcHet=0.999999
8: DP=379;VDB=0.96;SGB=3.23492;RPBZ=-0.0347515;MQBZ=0;MQSBZ=0;BQBZ=-0.388981;SCBZ=2.72709;MQ0F=0;AC=1;AN=96;DP4=104,106,1,1;MQ=60;NS=48;AF=0.0104167;MAF=0.0104167;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
9: DP=369;VDB=0.315555;SGB=36.4566;RPBZ=-2.71283;MQBZ=0;MQSBZ=0;BQBZ=1.2974;SCBZ=2.67418;MQ0F=0;AC=20;AN=96;DP4=81,82,23,18;MQ=60;NS=48;AF=0.208333;MAF=0.208333;AC_Het=4;AC_Hom=16;AC_Hemi=0;HWE=2.57538e-06;ExcHet=1
10: DP=369;VDB=0.915685;SGB=53.5568;RPBZ=-1.02069;MQBZ=0;MQSBZ=0;BQBZ=0.718963;SCBZ=1.07016;MQ0F=0;AC=52;AN=96;DP4=45,51,58,50;MQ=60;NS=48;AF=0.541667;MAF=0.458333;AC_Het=2;AC_Hom=50;AC_Hemi=0;HWE=6.77963e-12;ExcHet=1
FORMAT GG15_Nb
1: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,3,60:1:1,0:19:0:0
2: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,3,60:1:1,0:23:0:0
3: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,3,60:1:1,0:8:0:0
4: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,6,110:2:2,0:23:0:0
5: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,3,60:1:1,0:19:0:0
6: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,3,48:1:1,0:5:0:0
7: GT:PL:DP:AD:GQ:VAF:VAF1 0/1:0,6,110:2:2,0:3:0:0
8: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,12,184:4:4,0:27:0:0
9: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,12,177:4:4,0:13:0:0
10: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,12,212:4:4,0:9:0:0
> fl<-"/Vcfs/Nf_ref/Nfusc_ref_60samp_bisnps_no_rpt.chr-only.vcf.gz"
> neotoma<-readVCF(fl,verbose=F)
|--------------------------------------------------|
|==================================================|
> mss<-get.miss(neotoma,verbose = F, plot=F)
> head(mss$perSample)
indiv n_miss f_miss
1 GG15_Nb 21008501 0.16180989
2 GG18_Nb 18612236 0.14335358
3 GG21_Nb 26996285 0.20792849
4 GG22_Nb 26212519 0.20189183
5 GG48_Nb 10693752 0.08236451
6 GG49_Nb 20498896 0.15788485
> mss<-get.miss(neotoma,verbose = TRUE, plot=TRUE)
generating table
|==================================================| 100%
> dev.copy(png, "missing_samples_plot_rCNV.png")
png
3
> dev.off()
null device
1
> dim(neotoma$vcf)
[1] 129834471 69
> sam<-which(mss$perSample$f_miss>0.5)+9
> neotoma<-data.frame(neotoma$vcf)[,-sam]
> dim(neotoma)
[1] 129834471 0
> rels<-relatedness(neotoma)
Error in 1:ncol(gg) : argument of length 0
> neotoma
data frame with 0 columns and 129834471 rows
My vcf seems to have same columns as the example:
MINE:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GG15_Nb GG18_Nb GG21_Nb GG22_Nb GG48_Nb
EXAMPLE:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DP_1 DP_10 DP_11 DP_12 DP_13 DP_14 DP_15
Can you see what I am doing wrong?
Looking at the error you got, the problem is with the number of samples/snps with missing data.
Before, you do this step neotoma<-data.frame(neotoma$vcf)[,-sam]
check how many of your samples have missing data and according to the percentage you want to filter them below (e.g., 0.5). And then check if you get any missing samples at this step sam<-which(mss$perSample$f_miss>0.5)+9
. If that doesn't get you any output, you don't need to remove any samples. Repeat the same for SNPs.
I'm closing this issue because the issue is not with the package itself but with how you handle the data. Please open this in the discussions and I can help you there.
Hello, I am trying your package for the first time, but got errors on install and it failed the very first example line. I am on Ubuntu 20, R version 4.3.1 (2023-06-16). I started with:
install.packages("rCNV")
the error and warnings were:
and
What have I done wrong? thank you,