piyalkarum / rCNV

An R package for detecting copy number variants from SNPs data
https://piyalkarum.github.io/rCNV/
GNU Affero General Public License v3.0
5 stars 1 forks source link

first attempt #1

Closed wbsimey closed 10 months ago

wbsimey commented 11 months ago

Hello, I am trying your package for the first time, but got errors on install and it failed the very first example line. I am on Ubuntu 20, R version 4.3.1 (2023-06-16). I started with:

install.packages("rCNV")

the error and warnings were:

ERROR: dependency ‘reshape2’ is not available for package ‘qgraph’
* removing ‘/home/bsimison/R/x86_64-pc-linux-gnu-library/4.3/qgraph’
ERROR: dependency ‘qgraph’ is not available for package ‘rCNV’
* removing ‘/home/bsimison/R/x86_64-pc-linux-gnu-library/4.3/rCNV’

and

Warning messages:
1: In install.packages("rCNV") :
  installation of package ‘reshape2’ had non-zero exit status
2: In install.packages("rCNV") :
  installation of package ‘qgraph’ had non-zero exit status
3: In install.packages("rCNV") :
  installation of package ‘rCNV’ had non-zero exit status
> library(rCNV)
Error in library(rCNV) : there is no package called ‘rCNV’
> vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")
Error in path.package("rCNV") : none of the packages are loaded

What have I done wrong? thank you,

wbsimey commented 11 months ago

UPDATE: I installed the development version and it is working on example data.

wbsimey commented 11 months ago

I am analyzing whole genome data for 60 mammalian samples (2.3Gb ea). We used bcf to call variants. My vcf.gz is 32G. Is there a way to multithread the package? PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3526967 bsimison 20 0 303.4g 146.9g 64.0g S 226.8 7.3 48:17.37 R

piyalkarum commented 11 months ago

Hi Brian, I'm glad you were able to install the developmental version successfully, which actually has more of the latest changes to the package. Regarding your question about multithreading the analysis, currently, we're working on different methods to implement parallelization of the package. Still, unfortunately, it is not yet included in either of the versions available online.

32 GB does sound like a big vcf file. However, we've managed to work on vcf files of up to 60 gb on a server with RAM of 252 gb. So, my first suggestion is to try the analysis on a computer cluster/server with more processing power. In this case, you might want to change the memory allocation for R if it is insufficient (R --max-mem-size=Inf).

My second suggestion is to split your vcf file into smaller chunks (of SNPs), generate the allele depth and genotype tables normally using hetTgen(), save them separately, and then merge them in a separate session with rbind(), after which you can continue the analysis following the workflow. Most functions accept a data frame with the correct columns, including relatedness and h.zygosity functions (please check the help pages). And as you can imagine, if you can import only the filtered version of the vcfs, it will be even faster.

My third suggestion is to import the genotype and allele depth tables to R instead of the vcf file. If you look at the output of hetTgen() function, the data frame format looks like this "CHROM | POS | ID | ALT | sample1 | sample2 etc.". Therefore, you can use a program with faster languages like vcftools or gatK to extract the depth and genotype tables and import them using generic functions like read.table() or data.table::fread(), after which you can continue with the analysis.

As I mentioned earlier, we're still working on parallelizing the analysis for using a multithreaded approach and we will announce it once it's been fully tested. Apologies for the inconvenience and thank you for using our package.

wbsimey commented 11 months ago

thank you for the suggestions. Much appreciated. I got stuck at the relatedness step. I fear it is format issue. The following is what I have done so far:

> vcf.file.path <-  "/Vcfs/Nf_ref/Nfusc_ref_60samp_bisnps_no_rpt.chr-only.vcf.gz"
> vcf <- readVCF(vcf.file.path,verbose = FALSE)
|--------------------------------------------------|
|==================================================|
> vcf$vcf[1:10,1:10]
     #CHROM  POS ID REF ALT       QUAL FILTER
 1: Chr1_nf  223  .   A   G   12.14790   PASS
 2: Chr1_nf  227  .   G   A    3.52438   PASS
 3: Chr1_nf  253  .   T   C  718.17200   PASS
 4: Chr1_nf 1133  .   A   C   72.98750   PASS
 5: Chr1_nf 1213  .   T   A   11.86950   PASS
 6: Chr1_nf 1261  .   C   T 1997.20000   PASS
 7: Chr1_nf 1262  .   A   G 5517.70000   PASS
 8: Chr1_nf 1272  .   T   A   61.08510   PASS
 9: Chr1_nf 1281  .   G   A 1693.36000   PASS
10: Chr1_nf 1291  .   A   C 4326.44000   PASS
                                                                                                                                                                                                                                         INFO
 1:                              DP=243;SGB=-0.62851;RPBZ=-1.04955;MQBZ=0.149616;MQSBZ=1.68224;BQBZ=0.561525;SCBZ=-0.194582;MQ0F=0;AC=1;AN=88;DP4=70,66,1,0;MQ=59;NS=44;AF=0.0113636;MAF=0.0113636;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
 2:                                          DP=236;SGB=-0.62851;RPBZ=-0.560139;MQBZ=0;MQSBZ=0;BQBZ=0.539117;SCBZ=-0.197596;MQ0F=0;AC=1;AN=88;DP4=69,63,1,0;MQ=60;NS=44;AF=0.0113636;MAF=0.0113636;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
 3: DP=199;VDB=0.618714;SGB=22.5396;RPBZ=-1.88304;MQBZ=0.497183;MQSBZ=-1.00905;BQBZ=-0.604236;SCBZ=0.297817;MQ0F=0;AC=8;AN=84;DP4=48,41,8,14;MQ=59;NS=42;AF=0.0952381;MAF=0.0952381;AC_Het=4;AC_Hom=4;AC_Hemi=0;HWE=0.0296622;ExcHet=0.999217
 4:                                    DP=406;VDB=0.00701511;SGB=11.3066;RPBZ=-2.58897;MQBZ=0;MQSBZ=0;BQBZ=-3.80939;SCBZ=-0.269644;MQ0F=0;AC=1;AN=100;DP4=110,113,2,2;MQ=60;NS=50;AF=0.01;MAF=0.01;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
 5:                           DP=317;SGB=-0.62851;RPBZ=1.18978;MQBZ=0.0743294;MQSBZ=-0.895533;BQBZ=0.650265;SCBZ=-0.105408;MQ0F=0;AC=1;AN=90;DP4=80,101,1,0;MQ=59;NS=45;AF=0.0111111;MAF=0.0111111;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
 6:                   DP=350;VDB=0.994909;SGB=31.8732;RPBZ=0.641388;MQBZ=0;MQSBZ=0;BQBZ=0.0658554;SCBZ=-0.209147;MQ0F=0;AC=25;AN=96;DP4=77,74,27,20;MQ=60;NS=48;AF=0.260417;MAF=0.260417;AC_Het=5;AC_Hom=20;AC_Hemi=0;HWE=1.4163e-06;ExcHet=1
 7:                 DP=352;VDB=0.999629;SGB=38.809;RPBZ=1.2117;MQBZ=0;MQSBZ=0;BQBZ=0.132143;SCBZ=1.46634;MQ0F=0;AC=74;AN=96;DP4=28,31,77,63;MQ=60;NS=48;AF=0.770833;MAF=0.229167;AC_Het=6;AC_Hom=68;AC_Hemi=0;HWE=3.55894e-05;ExcHet=0.999999
 8:                                DP=379;VDB=0.96;SGB=3.23492;RPBZ=-0.0347515;MQBZ=0;MQSBZ=0;BQBZ=-0.388981;SCBZ=2.72709;MQ0F=0;AC=1;AN=96;DP4=104,106,1,1;MQ=60;NS=48;AF=0.0104167;MAF=0.0104167;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1
 9:                       DP=369;VDB=0.315555;SGB=36.4566;RPBZ=-2.71283;MQBZ=0;MQSBZ=0;BQBZ=1.2974;SCBZ=2.67418;MQ0F=0;AC=20;AN=96;DP4=81,82,23,18;MQ=60;NS=48;AF=0.208333;MAF=0.208333;AC_Het=4;AC_Hom=16;AC_Hemi=0;HWE=2.57538e-06;ExcHet=1
10:                     DP=369;VDB=0.915685;SGB=53.5568;RPBZ=-1.02069;MQBZ=0;MQSBZ=0;BQBZ=0.718963;SCBZ=1.07016;MQ0F=0;AC=52;AN=96;DP4=45,51,58,50;MQ=60;NS=48;AF=0.541667;MAF=0.458333;AC_Het=2;AC_Hom=50;AC_Hemi=0;HWE=6.77963e-12;ExcHet=1
                     FORMAT                   GG15_Nb
 1: GT:PL:DP:AD:GQ:VAF:VAF1   0/0:0,3,60:1:1,0:19:0:0
 2: GT:PL:DP:AD:GQ:VAF:VAF1   0/0:0,3,60:1:1,0:23:0:0
 3: GT:PL:DP:AD:GQ:VAF:VAF1    0/0:0,3,60:1:1,0:8:0:0
 4: GT:PL:DP:AD:GQ:VAF:VAF1  0/0:0,6,110:2:2,0:23:0:0
 5: GT:PL:DP:AD:GQ:VAF:VAF1   0/0:0,3,60:1:1,0:19:0:0
 6: GT:PL:DP:AD:GQ:VAF:VAF1    0/0:0,3,48:1:1,0:5:0:0
 7: GT:PL:DP:AD:GQ:VAF:VAF1   0/1:0,6,110:2:2,0:3:0:0
 8: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,12,184:4:4,0:27:0:0
 9: GT:PL:DP:AD:GQ:VAF:VAF1 0/0:0,12,177:4:4,0:13:0:0
10: GT:PL:DP:AD:GQ:VAF:VAF1  0/0:0,12,212:4:4,0:9:0:0

> fl<-"/Vcfs/Nf_ref/Nfusc_ref_60samp_bisnps_no_rpt.chr-only.vcf.gz"
> neotoma<-readVCF(fl,verbose=F)
|--------------------------------------------------|
|==================================================|
> mss<-get.miss(neotoma,verbose = F, plot=F)

> head(mss$perSample)
    indiv   n_miss     f_miss
1 GG15_Nb 21008501 0.16180989
2 GG18_Nb 18612236 0.14335358
3 GG21_Nb 26996285 0.20792849
4 GG22_Nb 26212519 0.20189183
5 GG48_Nb 10693752 0.08236451
6 GG49_Nb 20498896 0.15788485
> mss<-get.miss(neotoma,verbose = TRUE, plot=TRUE)
generating table
  |==================================================| 100%
> dev.copy(png, "missing_samples_plot_rCNV.png")
png
  3
> dev.off()
null device
          1
> dim(neotoma$vcf)
[1] 129834471        69

> sam<-which(mss$perSample$f_miss>0.5)+9 

> neotoma<-data.frame(neotoma$vcf)[,-sam]

> dim(neotoma)
[1] 129834471         0

> rels<-relatedness(neotoma)
Error in 1:ncol(gg) : argument of length 0

> neotoma
data frame with 0 columns and 129834471 rows

My vcf seems to have same columns as the example: MINE: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GG15_Nb GG18_Nb GG21_Nb GG22_Nb GG48_Nb EXAMPLE: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DP_1 DP_10 DP_11 DP_12 DP_13 DP_14 DP_15

Can you see what I am doing wrong?

piyalkarum commented 10 months ago

Looking at the error you got, the problem is with the number of samples/snps with missing data. Before, you do this step neotoma<-data.frame(neotoma$vcf)[,-sam] check how many of your samples have missing data and according to the percentage you want to filter them below (e.g., 0.5). And then check if you get any missing samples at this step sam<-which(mss$perSample$f_miss>0.5)+9. If that doesn't get you any output, you don't need to remove any samples. Repeat the same for SNPs.

I'm closing this issue because the issue is not with the package itself but with how you handle the data. Please open this in the discussions and I can help you there.