Closed apfuentes closed 6 years ago
Hi, Some info that could be of interest:
# Read genpop file, drop first column (with pop names).
mydata <- fread("./GENPOP_files/11pop.minDP20.genpop", header=TRUE,sep="\t",drop=c(1),stringsAsFactors=FALSE)
totalColSums <- colSums(mydata)
sum(totalColSums == 0) [1] 0
2. The Genpop object is created successfully:
/// GENPOP OBJECT /////////
// 11 populations; 7,654,945 loci; 14,949,257 alleles; size: 4.6 Gb
// Basic content @tab: 11 x 14949257 matrix of allele counts @loc.n.all: number of alleles per locus (range: 1-2) @loc.fac: locus factor for the 14949257 columns of @tab @all.names: list of allele names for each locus @ploidy: ploidy of each individual (range: 2-2) @type: codom @call: .local(.Object = .Object, tab = ..1)
// Optional content
In the first post I said I have 14 million SNPs, it is actually 7.5 million biallelic SNPs (that in genpop format it equals to 7.5 million x 2 alleles ~ 14 million columns).
In this case, I am not sure a genpop object is useful, especially for a correspondence analysis. I would recommend running the analysis directly on mydata
. It may help alleviate the issue, even though the dataset remains massive.. let us know if it works.
Do you have frequencies or counts? As of adegenet 2.0, gene pop objects take count data. Given the error message:
In sum(c(0L, 0L, 0L, 5L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 44L, 43L, 47L, :
integer overflow - use sum(as.numeric(.))
Execution halted
I would suggest that you convert the matrix to numeric (instead of integer) beforehand:
mydata[] <- as.numeric(mydata) # the empty square brackets are important
dudi.coa(mydata, scannf = FALSE, nf = 3)
Please do let us know which solution works :)
Zhian N. Kamvar, Ph. D. Postdoctoral Researcher (Everhart Lab) Department of Plant Pathology University of Nebraska-Lincoln ORCID: 0000-0003-1458-7108
On Feb 6, 2018, at 10:08 , Thibaut Jombart notifications@github.com wrote:
In this case, I am not sure a genpop object is useful, especially for a correspondence analysis. I would recommend running the analysis directly on mydata. It may help alleviate the issue, even though the dataset remains massive.. let us know if it works.
ā You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thibautjombart/adegenet/issues/222#issuecomment-363471586, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeIllJ2GUk7qhC77vSCEZDA1BWiSgEqks5tSHjhgaJpZM4R5nss.
Thanks @thibautjombart and @zkamvar for your prompt response!
Yes, I do have allele counts per SNP (that are equivalent to allele frequencies as I have Pool-seq data). The structure of the dataset is (following recommendations given in the basic tutorial, section 4.7, pg. 29):
marker1.allele1 marker1.allele2 marker2.allele1 marker2.allele2
Pop1 0 20 12 40
Pop2 2 30 7 40
Pop3 0 10 0 33
Pop4 0 50 0 20
.
.
.
Pop11
I tried both suggestions, however the problem persists:
># Import dataset.
>mydata <- fread("./path/11pops.minDP20.genpop", header=TRUE,sep="\t",drop=c(1),stringsAsFactors=FALSE)
># Conduct CoA.
>ca1 <- dudi.coa(mydata,scannf=FALSE,nf=3)
Error in if ((N <- sum(df)) == 0) stop("all frequencies are zero") : missing value where TRUE/FALSE needed In addition: Warning message: In sum(c(0L, 0L, 0L, 5L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 44L, 43L, 47L, : integer overflow - use sum(as.numeric(.))
- *Option 2: Transform the imported dataframe into numeric before calling dudi.coa().*
I got this error message:
Import dataset.
mydata <- fread("./path/11pops.minDP20.genpop", header=TRUE,sep="\t",drop=c(1),stringsAsFactors=FALSE)
Transform dataframe to numeric.
mydata[] <- as.numeric(mydata) Error: (list) object cannot be coerced to type 'double'
Any other ideas on how to solve this issue? Thanks!! Angela
Ah, I forgot that you have a data frame going into there. A data frame is basically a rectangular list, which is why you get Error: (list) object cannot be coerced to type 'double'
. You should convert it to a matrix beforehand:
mydata[] <- as.numeric(as.matrix(mydata))
Hi @zkamvar, no worries, thanks for taking the time to answer my question.
I tried what you suggested, it runs for a long time (~1h) then it crashes. I noticed 100% of the RAM was being used for this task, and my computer has 125 GB of RAM...
Is there any other way I could transform the dataframe to numeric?
I probably should have expected that. R likes to copy things before it manipulates them and numeric values are twice the memory size of integer values.
Given that you're using data.table::fread()
, I see that there's an option to convert the data to numeric already: https://www.rdocumentation.org/packages/data.table/versions/1.10.4-2/topics/fread
you could try adding integer64 = "numeric"
to your data.table call (though be aware that I haven't estimated how much memory you'll need for that). Hey, it could work :)
Hey @zkamvar,
Using integer64 = "numeric"
unfortunately did not fix the problem, as I obtain the very same error message:
> library(data.table)
> mydata <- fread("./path/11pops.minDP20.genpop", header = TRUE, sep = "\t", drop = c(1), stringsAsFactors = FALSE, integer64 = "numeric")
Read 11 rows and 14949257 (of 14949258) columns from 0.781 GB file in 00:01:10
> ca1 <- dudi.coa(mydata, scannf = FALSE, nf = 3)
Error in if ((N <- sum(df)) == 0) stop("all frequencies are zero") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In sum(c(0L, 0L, 0L, 5L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 44L, 43L, 47L, :
integer overflow - use sum(as.numeric(.))
The memory usage was normal. I wonder if the integer64 = "numeric"
thing actually works because the error message complains again about integer overflow...but not sure how to test it... I explored the file using:
> typeof(mydata)
[1] "list"
> mode(mydata)
[1] "list"
> class(mydata)
[1] "data.table" "data.frame"
> ncol(mydata)
[1] 14949257
> nrow(mydata)
[1] 11
> typeof(mydata[,1])
[1] "list"
Also, I wonder, wouldn't be the problem related to memory or Unix ulimit
? When I first run fread
in the whole dataset I got an error message related to memory overflow. After many hours searching for solutions I found that increasing the stack size of Linux could fix it, and it did. These are the current ulimit
s in my machine:
$ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) unlimited
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes unlimited
-n: file descriptors 80000
-l: locked-in-memory size (kbytes) 64
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 1030562
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited
Thanks for your kind help!
Hi, The error persist. How could it be fixed?
Hello @apfuentes,
Okay, so modifying the argument in data.table didn't work. The way you can check if the columns of your data frame are numeric is by running sapply(mydata, is.integer)
, but I get the feeling it will tell you that you have all integers anyways. So, I have two options for you:
mydata <- data.frame(lapply(mydata, as.numeric))
This will take up more memory on your machine, but it should work.
read_tsv()
or read_table()
from the readr
package: http://readr.tidyverse.org/reference/read_table.htmlread_tsv("./path/11pops.minDP20.genpop", header = TRUE, skip = 1, col_types = cols(.default = "double"))
Since it appears that you only have numeric data anyways, this will force all columns in your data to be represented as numeric.
Let me know if this helps, Zhian
Hi @zkamvar, Thanks very much for your recommendations. Here an update. I tried both approaches, this is the code used in each case:
Option 1:
library(data.table)
path_to_file <- "path/to/dataset"
mydata <- fread(path_to_file, header=TRUE, sep="\t", drop=c(1), stringsAsFactors=FALSE) # Load data
mydata <- data.frame(lapply(mydata, as.numeric)) # Transform data.table into data.frame and integers to numeric
ca1 <- dudi.coa(mydata,scannf=FALSE,nf=3) # CoA
Option 2:
library(readr)
path_to_file <- "path/to/dataset"
mydata <- read_tsv(path_to_file, col_names = TRUE, col_types = cols(.default = "double", X1 = col_skip()), progress = show_progress()) # Load data
ca1 <- dudi.coa(mydata,scannf=FALSE,nf=3) # CoA
Both scripts have been running for 24 h. Option 1 is using 27 GB of RAM and it is now at the ca1
step, this suggests the data transformation worked and the CoA is running. In contrast, Option 2 is using 7 GB and it is stuck loading the data...Is the script for Option 2 correct? 24h to load the data is too much...considering that loading the data with fread()
takes 2 min...
I have been exploring the possibility of using an alternative R package to dudi.coa()
to do a CoA. So far, the only package that has worked with the whole dataset of 7M SNPs is CA()
implemented in FactoMineR, website. It requires 25GB of RAM and it takes 8h to finish. I guess both CoA would be equivalent, but I am not sure...thus I wanted to ask, is there any way to confirm the equivalence? Or this could just be assumed?
Pd. I used exactly the same dataset and format for both packages:
marker1.allele1 marker1.allele2 marker2.allele1 marker2.allele2 ...
Pop1 0 20 12 40
Pop2 2 30 7 40
Pop3 0 10 0 33
Pop4 0 50 0 20
.
.
.
Pop11
Thanks for all your help!
Hi @zkamvar, I just wanted to confirm that Option 1 worked and it took 39h to finish the CoA using dudi.coa(). I compared the eigenvalues obtained with both packages and they are the same.
Thanks for all your help!
Hi @apfuentes,
This is great to hear! Sorry for not responding earlier, there are a lot of things up in the air right now.
Thank you for not only posting the details of your code, but also for posting your solutions. They really helped diagnose the issue š
Hi, Awhile ago I reported the issue I was having with dudi.coa() to Stephane Dray, one of the developers of the ade4 package. He has fixed the bug and the fix is now in github. I thought should let you know it and maybe adegenet could incorporate the fix.
Best,
Hi @apfuentes, thank you for letting us know! Luckily, if anyone has the latest version of ade4 installed, the functionality will be automatically incorporated.
Dear adegenet developers,
I created a genpop file based on an allele counts matrix (allele frequencies per population) to conduct a Correspondence Analysis (CoA) in adegenet (following the basic tutorial). However, I am obtaining this error message when the CoA is being conducted with the function dudi.coa():
I found in the internet that this could be caused when summarising large integers and it depends on how the package deals with integer vectors. To expand larger integers some people recommend using the
Rmpfr package
, but I don't know how to make adegenet use that package when dudi.coa() function is running. Thus I was wondering if you could please indicate me how this could be modified so the problem is hopefully fixed.My dataset consists on allele frequencies of 14 million SNPs for 11 populations. This is the R code I used:
Thanks for your kind help!