Error in subset.default(infos, pNA > 0.001): object 'pNA' not found

moldach commented 6 years ago

I'm having some trouble with the "preprocessing" script and not able to generate the plot at the end.

I'm using my own data, loading in .ped and .map files and converting them with snp_plinkQC() and snp_plinkIBDQC().

plink <- download_plink() NCORES <- nb_cores() technoQC.bed <- snp_plinkQC(prefix.in = "data/technoTeens", plink.path = plink, file.type = "--file", # ped/map geno = 0.05, mind = 0.05, maf = 0.05, hwe = 1e-10, autosome.only = TRUE )

PLINK v1.90b6.4 64-bit (7 Aug 2018) www.cog-genomics.org/plink/1.9/ (C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to data/technoTeens_QC.log. Options in effect: --autosome --file data/technoTeens --geno 0.05 --hwe 1e-10 --maf 0.05 --make-bed --mind 0.05 --out data/technoTeens_QC

257658 MB RAM detected; reserving 128829 MB for main workspace. Allocated 96621 MB successfully, after larger attempt(s) failed. .ped scan complete (for binary autoconversion). Performing single-pass .bed write (669934 variants, 423 people). --file: data/technoTeens_QC-temporary.bed + data/technoTeens_QC-temporary.bim + data/technoTeens_QC-temporary.fam written. 669934 variants loaded from .bim file. 423 people (0 males, 0 females, 423 ambiguous) loaded from .fam. Ambiguous sex IDs written to data/technoTeens_QC.nosex . 1 person removed due to missing genotype data (--mind). ID written to data/technoTeens_QC.irem . Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 422 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate in remaining samples is 0.992935. 8604 variants removed due to missing genotype data (--geno). --hwe: 2294 variants removed due to Hardy-Weinberg exact test. 338458 variants removed due to minor allele threshold(s) (--maf/--max-maf/--mac/--max-mac). 320578 variants and 422 people pass filters and QC. Note: No phenotypes present. --make-bed to data/technoTeens_QC.bed + data/technoTeens_QC.bim + data/technoTeens_QC.fam ... done.

technoQC2.bed <- snp_plinkIBDQC( bedfile.in = technoQC.bed, plink.path = plink, ncores = NCORES ) PLINK v1.90b6.4 64-bit (7 Aug 2018) www.cog-genomics.org/plink/1.9/ (C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to /tmp/Rtmpg7a6zA/file6adc6e7d579f.log. Options in effect: --bfile data/technoTeens_QC --indep-pairwise 100 1 0.2 --out /tmp/Rtmpg7a6zA/file6adc6e7d579f

257658 MB RAM detected; reserving 128829 MB for main workspace. Allocated 96621 MB successfully, after larger attempt(s) failed. 320578 variants loaded from .bim file. 422 people (0 males, 0 females, 422 ambiguous) loaded from .fam. Ambiguous sex IDs written to /tmp/Rtmpg7a6zA/file6adc6e7d579f.nosex . Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 422 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.997882. 320578 variants and 422 people pass filters and QC. Note: No phenotypes present. Pruned 13777 variants from chromosome 1, leaving 11023. Pruned 15056 variants from chromosome 2, leaving 10855. Pruned 12506 variants from chromosome 3, leaving 9649. Pruned 11249 variants from chromosome 4, leaving 8904. Pruned 10497 variants from chromosome 5, leaving 8289. Pruned 16064 variants from chromosome 6, leaving 8472. Pruned 10190 variants from chromosome 7, leaving 7938. Pruned 9629 variants from chromosome 8, leaving 7051. Pruned 7762 variants from chromosome 9, leaving 6492. Pruned 8934 variants from chromosome 10, leaving 7148. Pruned 8982 variants from chromosome 11, leaving 6730. Pruned 8438 variants from chromosome 12, leaving 6958. Pruned 6310 variants from chromosome 13, leaving 5307. Pruned 5699 variants from chromosome 14, leaving 4841. Pruned 5450 variants from chromosome 15, leaving 4709. Pruned 5894 variants from chromosome 16, leaving 5071. Pruned 5003 variants from chromosome 17, leaving 4729. Pruned 5036 variants from chromosome 18, leaving 4680. Pruned 3750 variants from chromosome 19, leaving 3755. Pruned 4288 variants from chromosome 20, leaving 3808. Pruned 2469 variants from chromosome 21, leaving 2258. Pruned 2496 variants from chromosome 22, leaving 2432. Pruning complete. 179479 of 320578 variants removed. Marker lists written to /tmp/Rtmpg7a6zA/file6adc6e7d579f.prune.in and /tmp/Rtmpg7a6zA/file6adc6e7d579f.prune.out . PLINK v1.90b6.4 64-bit (7 Aug 2018) www.cog-genomics.org/plink/1.9/ (C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to /tmp/Rtmpg7a6zA/file6adc6e7d579f.log. Options in effect: --bfile data/technoTeens_QC --extract /tmp/Rtmpg7a6zA/file6adc6e7d579f.prune.in --genome --min 0.08 --out /tmp/Rtmpg7a6zA/file6adc6e7d579f --threads 23

257658 MB RAM detected; reserving 128829 MB for main workspace. Allocated 96621 MB successfully, after larger attempt(s) failed. 320578 variants loaded from .bim file. 422 people (0 males, 0 females, 422 ambiguous) loaded from .fam. Ambiguous sex IDs written to /tmp/Rtmpg7a6zA/file6adc6e7d579f.nosex . --extract: 141099 variants remaining. Using up to 23 threads (change this with --threads). Before main variant filters, 422 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.997893. 141099 variants and 422 people pass filters and QC. Note: No phenotypes present. IBD calculations complete.
Finished writing /tmp/Rtmpg7a6zA/file6adc6e7d579f.genome . PLINK v1.90b6.4 64-bit (7 Aug 2018) www.cog-genomics.org/plink/1.9/ (C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to data/technoTeens_QC_norel.log. Options in effect: --bfile data/technoTeens_QC --make-bed --out data/technoTeens_QC_norel --remove /tmp/Rtmpg7a6zA/file6adc4e65cf65

257658 MB RAM detected; reserving 128829 MB for main workspace. Allocated 96621 MB successfully, after larger attempt(s) failed. 320578 variants loaded from .bim file. 422 people (0 males, 0 females, 422 ambiguous) loaded from .fam. Ambiguous sex IDs written to data/technoTeens_QC_norel.nosex . --remove: 195 people remaining. Warning: At least 17736 duplicate IDs in --remove file. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 195 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate in remaining samples is 0.997835. 320578 variants and 195 people pass filters and QC. Note: No phenotypes present. --make-bed to data/technoTeens_QC_norel.bed + data/technoTeens_QC_norel.bim + data/technoTeens_QC_norel.fam ... done.

system.time( technoQC.rds <- snp_readBed(technoQC2.bed, "technoQC") ) user system elapsed 1.109 0.054 1.161

str(technoTeens, max.level = 2) List of 3 $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 10 fields ..and 22 methods, of which 8 are possibly relevant: .. add_columns, as.FBM, copy#envRefClass, initialize, .. initialize#FBM, save, show#envRefClass, show#FBM $ fam :'data.frame': 195 obs. of 6 variables: ..$ family.ID : int [1:195] 1 3 4 6 8 11 12 13 17 21 ... ..$ sample.ID : chr [1:195] "202528140036_R01C01" "202528140036_R02C01" "202528140036_R02C02" "202528140036_R03C02" ... ..$ paternal.ID: int [1:195] 0 0 0 0 0 0 0 0 0 0 ... ..$ maternal.ID: int [1:195] 0 0 0 0 0 0 0 0 0 0 ... ..$ sex : int [1:195] 0 0 0 0 0 0 0 0 0 0 ... ..$ affection : int [1:195] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ... $ map :'data.frame': 320578 obs. of 6 variables: ..$ chromosome : int [1:320578] 1 1 1 1 1 1 1 1 1 1 ... ..$ marker.ID : chr [1:320578] "GSA-rs114420996" "rs3131972" "rs12127425" "rs4970383" ... ..$ genetic.dist: num [1:320578] 0 0 0 0 0 0 0 0 0 0 ... ..$ physical.pos: int [1:320578] 58814 752721 794332 838555 840753 846808 853954 854250 861808 866893 ... ..$ allele1 : chr [1:320578] "A" "G" "A" "A" ... ..$ allele2 : chr [1:320578] "G" "A" "G" "C" ...

attr(*, "class")= chr "bigSNP"

G <- technoTeens$genotypes big_counts(G, ind.col = 1:10) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 0 0 31 2 23 38 27 26 14 40 8 1 55 91 31 87 99 89 89 87 97 51 2 133 70 162 83 58 79 80 94 58 136
7 3 0 2 0 0 0 0 0 0

system.time( infos <- snp_fastImpute(G, technoTeens$map$chromosome, ncores = NCORES) ) user system elapsed 0.791 1.128 185.524

plot(subset(infos, pNA > 0.001), pch = 19, cex = 0.5) Error in subset.default(infos, pNA > 0.001) : object 'pNA' not found``

privefl commented 6 years ago

Can you please put the output of str(infos)?

I think you're using an old example. I slightly modified the output of this function so that it can start wherever it stopped, if you need to stop it or your computer has a problem during imputation.

Please refer to the new documentation. You need one more line of code I think.

moldach commented 6 years ago

Do you mean this for the new documentation?

I think something is going wrong even before the imputation step, but here is str(infos) anyways:


str(infos)
Reference class 'FBM' [package "bigstatsr"] with 9 fields
 $ extptr     :<externalptr> 
 $ nrow       : num 2
 $ ncol       : int 320578
 $ type       : Named int 8
  ..- attr(*, "names")= chr "double"
 $ backingfile: chr "/exports/meaney.lab/moldach/technoTeens_bigsnpr/data/technoTeens_QC-infos-impute.bk"
 $ is_saved   : logi TRUE
 $ address    :<externalptr> 
 $ bk         : chr "/exports/meaney.lab/moldach/technoTeens_bigsnpr/data/technoTeens_QC-infos-impute.bk"
 $ rds        : chr "/exports/meaney.lab/moldach/technoTeens_bigsnpr/data/technoTeens_QC-infos-impute.rds"
 and 18 methods, of which 4 are  possibly relevant:
   add_columns, initialize, save, show#envRefClass

If I look at that documentation compared to what I have above the stuff that was missing we the following:

CHR <- technoTeens$map$chromosome
POS <- technoTeens$map$physical.pos

I get this error at snp_clumping() part of the PCA step:

# Half of the cores you have on your computer
NCORES <- nb_cores()
# Exclude Long-Range Linkage Disequilibrium Regions of the human genome
# based on an online table. 
ind.excl <- snp_indLRLDR(infos.chr = CHR, infos.pos = POS)
# Use clumping (on the MAF) to keep SNPs weakly correlated with each other.
# See https://privefl.github.io/bigsnpr/articles/pruning-vs-clumping.html
# to know why I prefer using clumping than standard pruning.
ind.keep <- snp_clumping(G, infos.chr = CHR,
                         exclude = ind.excl,
                         ncores = NCORES)
`Error: You can't have missing values in 'G'`

Looking back at G:

# Check some counts for the 10 first SNPs
big_counts(G, ind.col = 1:10)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
0       0   76    4   35   71   34  110   22   62    93
1      83  164   67  175  210  165  197  164  122   154
2     326  177  351  210  141  223  115  236  238   175
<NA>   13    5    0    2    0    0    0    0    0     0

privefl commented 6 years ago

Nop, I was talking about the documentation of the function.

You can get it using ?snp_fastImpute. Look at the Examples section.

moldach commented 6 years ago

So I followed the additional steps from ?snp_fastImpute, but I'm worried about how my plot looks compared to the one you gave in the Vignette.

plink <- download_plink()
NCORES <- nb_cores()  # We have 23 cores

technoQC.bed <- snp_plinkQC(
prefix.in = "data/technoTeens",
plink.path = plink,
file.type = "--file", # ped/map
geno = 0.05,
mind = 0.05,
maf = 0.05,
hwe = 1e-10,
autosome.only = TRUE
)

technoQC2.bed <- snp_plinkIBDQC(
bedfile.in = technoQC.bed,
plink.path = plink,
ncores = NCORES
)

technoQC.rds <- snp_readBed(technoQC.bed)

technoTeens <- snp_attach(technoQC.rds)
str(technoTeens, max.level = 2, strict.width = "cut")

G <- technoTeens$genotypes
CHR <- technoTeens$map$chromosome

G2 <- snp_fastImpute(G, CHR)
G2[, 1:5]

big_counts(G, ind.col = 1:10)
fake$genotypes$code256 <- bigsnpr:::CODE_IMPUTE_PRED
fake <- snp_save(fake)
big_counts(fake$genotypes)[4, ]

technoTeens$genotypes$code256 <- bigsnpr:::CODE_IMPUTE_PRED
technoTeens <- snp_save(technoTeens)
big_counts(technoTeens$genotypes)[4, ]

pvals <- c(0.01, 0.005, 0.002, 0.001); colvals <- 2:5
df <- data.frame(pNA = G2[1, ], pError = G2[2, ])

library(ggplot2)
Reduce(function(p, i) {
p + stat_function(fun = function(x) pvals[i] / x, color = colvals[i])
}, x = seq_along(pvals), init = ggplot(df, aes(pNA, pError))) +
geom_point() +
coord_cartesian(ylim = range(df$pError, na.rm = TRUE)) +
theme_bw(15)

This is the image I get:

bigsnpr_plot

Any idea why my plot would look like this?

str(G2)
Reference class 'FBM' [package "bigstatsr"] with 9 fields
 $ extptr     :<externalptr> 
 $ nrow       : num 2
 $ ncol       : int 320578
 $ type       : Named int 8
  ..- attr(*, "names")= chr "double"
 $ backingfile: chr "/exports/meaney.lab/moldach/technoTeens_bigsnpr/data/technoTeens_QC-infos-impute.bk"
 $ is_saved   : logi TRUE
 $ address    :<externalptr> 
 $ bk         : chr "/exports/meaney.lab/moldach/technoTeens_bigsnpr/data/technoTeens_QC-infos-impute.bk"
 $ rds        : chr "/exports/meaney.lab/moldach/technoTeens_bigsnpr/data/technoTeens_QC-infos-impute.rds"
 and 18 methods, of which 4 are  possibly relevant:
   add_columns, initialize, save, show#envRefClass

privefl commented 6 years ago

Seems fine to me. The only problem here is that your sample size is too small.

This is why the plot is so regular.

Also, because of this small sample size, I'm not sure if you can be confident in the estimated pError which is based on 20% of your non-missing data by default (which is only a few samples in your case).

privefl / bigsnpr

Error in subset.default(infos, pNA > 0.001): object 'pNA' not found #10