apfuentes commented 6 years ago

Dear adegenet developers,

I created a genpop file based on an allele counts matrix (allele frequencies per population) to conduct a Correspondence Analysis (CoA) in adegenet (following the basic tutorial). However, I am obtaining this error message when the CoA is being conducted with the function dudi.coa():

> ca1 <- dudi.coa(toto,scannf=FALSE,nf=3)
Error in if ((N <- sum(df)) == 0) stop("all frequencies are zero") :
  missing value where TRUE/FALSE needed
Calls: dudi.coa
In addition: Warning message:
In sum(c(0L, 0L, 0L, 5L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 44L, 43L, 47L,  :
  integer overflow - use sum(as.numeric(.))
Execution halted

I found in the internet that this could be caused when summarising large integers and it depends on how the package deals with integer vectors. To expand larger integers some people recommend using the Rmpfr package, but I don't know how to make adegenet use that package when dudi.coa() function is running. Thus I was wondering if you could please indicate me how this could be modified so the problem is hopefully fixed.

My dataset consists on allele frequencies of 14 million SNPs for 11 populations. This is the R code I used:

# ----------------Correspondence Analysis with adegenet --------------------- #
# Clean environment space.
rm(list=ls())

# Load required libraries.
library(adegenet)
library(data.table)

# Set working directory.
setwd("~/path/SNP_data/")

# Read genpop file, drop first column
mydata <- fread("./GENPOP_files/11pops.minDP20.genpop",
              header=TRUE,sep="\t",drop=c(1),stringsAsFactors=FALSE)

# Create a genpop object based on the input data.
toto <- new("genpop", mydata)

# Performing a Correspondance Analysis on genpop objects
# Being contingency tables, the @tab slot in genpop objects can be submitted to a
# Correspondance Analysis (CoA) to seek a typology of populations.

ca1 <- dudi.coa(toto,scannf=FALSE,nf=3)

pdf("CA_eigenvalues_minDP20.pdf")

barplot(ca1$eig,main="Correspondance Analysis eigenvalues (minDP20)",
        col=heat.colors(length(ca1$eig)))

dev.off()

# Now we display the resulting typology using a basic scatterplot:
pdf("CA_scatterplot_minDP20_Axes_1-2.pdf")

s.label(ca1$li, sub="CA 1-2",csub=2)
add.scatter.eig(ca1$eig,nf=3,xax=1,yax=2,posi="bottomright")

dev.off()

# The same graph is derived for the axes 2-3:
pdf("CA_scatterplot_minDP20_Axes_2-3.pdf")
s.label(ca1$li,xax=2,yax=3,lab=popNames(toto),sub="CA 1-3",csub=2)
add.scatter.eig(ca1$eig,nf=3,xax=2,yax=3,posi="topleft")

# Note that as an alternative, wordcloud can be used to avoid overlaps in labels:
library(wordcloud)
set.seed(1)
s.label(ca1$li*1.2, sub="CA 1-2",csub=2, clab=0, cpoint="")
textplot(ca1$li[,1], ca1$li[,2], words=popNames(toto),
         cex=1.4, new=FALSE, xpd=TRUE)
add.scatter.eig(ca1$eig,nf=3,xax=1,yax=2,posi="bottomright")

dev.off()

Thanks for your kind help!

apfuentes commented 6 years ago

Hi, Some info that could be of interest:

The total allele frequency (total sum of read counts per column) of all populations is not zero. I verified this by using:


# Read genpop file, drop first column (with pop names).
mydata <- fread("./GENPOP_files/11pop.minDP20.genpop", header=TRUE,sep="\t",drop=c(1),stringsAsFactors=FALSE)

Obtain total sum of allele counts per column and store it in a vector.

totalColSums <- colSums(mydata)

Verify if any total sum of a column is equal to zero.

sum(totalColSums == 0) [1] 0


2. The Genpop object is created successfully:

/// GENPOP OBJECT /////////

// 11 populations; 7,654,945 loci; 14,949,257 alleles; size: 4.6 Gb

// Basic content @tab: 11 x 14949257 matrix of allele counts @loc.n.all: number of alleles per locus (range: 1-2) @loc.fac: locus factor for the 14949257 columns of @tab @all.names: list of allele names for each locus @ploidy: ploidy of each individual (range: 2-2) @type: codom @call: .local(.Object = .Object, tab = ..1)

// Optional content

empty -


In the first post I said I have 14 million SNPs, it is actually 7.5 million biallelic SNPs (that in genpop format it equals to 7.5 million x 2 alleles ~ 14 million columns).

thibautjombart commented 6 years ago

In this case, I am not sure a genpop object is useful, especially for a correspondence analysis. I would recommend running the analysis directly on mydata. It may help alleviate the issue, even though the dataset remains massive.. let us know if it works.

zkamvar commented 6 years ago

Do you have frequencies or counts? As of adegenet 2.0, gene pop objects take count data. Given the error message:

In sum(c(0L, 0L, 0L, 5L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 44L, 43L, 47L,  :
  integer overflow - use sum(as.numeric(.))
Execution halted

I would suggest that you convert the matrix to numeric (instead of integer) beforehand:

mydata[] <- as.numeric(mydata) # the empty square brackets are important
dudi.coa(mydata, scannf = FALSE, nf = 3)

Please do let us know which solution works :)

Zhian N. Kamvar, Ph. D. Postdoctoral Researcher (Everhart Lab) Department of Plant Pathology University of Nebraska-Lincoln ORCID: 0000-0003-1458-7108

On Feb 6, 2018, at 10:08 , Thibaut Jombart notifications@github.com wrote:

In this case, I am not sure a genpop object is useful, especially for a correspondence analysis. I would recommend running the analysis directly on mydata. It may help alleviate the issue, even though the dataset remains massive.. let us know if it works.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thibautjombart/adegenet/issues/222#issuecomment-363471586, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeIllJ2GUk7qhC77vSCEZDA1BWiSgEqks5tSHjhgaJpZM4R5nss.

apfuentes commented 6 years ago

Thanks @thibautjombart and @zkamvar for your prompt response!

Yes, I do have allele counts per SNP (that are equivalent to allele frequencies as I have Pool-seq data). The structure of the dataset is (following recommendations given in the basic tutorial, section 4.7, pg. 29):

    marker1.allele1 marker1.allele2 marker2.allele1 marker2.allele2
Pop1    0   20  12  40
Pop2    2   30  7   40
Pop3    0   10  0   33
Pop4    0   50  0   20
.
.
.
Pop11

I tried both suggestions, however the problem persists:

Option 1: Skip creation of genpop file and use the imported dataframe directly into dudi.coa(). I obtained pretty much the same error message:


># Import dataset.
>mydata <- fread("./path/11pops.minDP20.genpop", header=TRUE,sep="\t",drop=c(1),stringsAsFactors=FALSE)       
># Conduct CoA.
>ca1 <- dudi.coa(mydata,scannf=FALSE,nf=3)

Error in if ((N <- sum(df)) == 0) stop("all frequencies are zero") : missing value where TRUE/FALSE needed In addition: Warning message: In sum(c(0L, 0L, 0L, 5L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 44L, 43L, 47L, : integer overflow - use sum(as.numeric(.))

- *Option 2: Transform the imported dataframe into numeric before calling dudi.coa().*
I got this error message:

Import dataset.

mydata <- fread("./path/11pops.minDP20.genpop", header=TRUE,sep="\t",drop=c(1),stringsAsFactors=FALSE)

Transform dataframe to numeric.

mydata[] <- as.numeric(mydata) Error: (list) object cannot be coerced to type 'double'
Any other ideas on how to solve this issue?
Thanks!!
Angela

zkamvar commented 6 years ago

Ah, I forgot that you have a data frame going into there. A data frame is basically a rectangular list, which is why you get Error: (list) object cannot be coerced to type 'double'. You should convert it to a matrix beforehand:

mydata[] <- as.numeric(as.matrix(mydata))

apfuentes commented 6 years ago

Hi @zkamvar, no worries, thanks for taking the time to answer my question.

I tried what you suggested, it runs for a long time (~1h) then it crashes. I noticed 100% of the RAM was being used for this task, and my computer has 125 GB of RAM...

Is there any other way I could transform the dataframe to numeric?

zkamvar commented 6 years ago

I probably should have expected that. R likes to copy things before it manipulates them and numeric values are twice the memory size of integer values.

Given that you're using data.table::fread(), I see that there's an option to convert the data to numeric already: https://www.rdocumentation.org/packages/data.table/versions/1.10.4-2/topics/fread

you could try adding integer64 = "numeric" to your data.table call (though be aware that I haven't estimated how much memory you'll need for that). Hey, it could work :)

apfuentes commented 6 years ago

Hey @zkamvar, Using integer64 = "numeric" unfortunately did not fix the problem, as I obtain the very same error message:

> library(data.table)
> mydata <- fread("./path/11pops.minDP20.genpop", header = TRUE, sep = "\t", drop = c(1), stringsAsFactors = FALSE, integer64 = "numeric")
Read 11 rows and 14949257 (of 14949258) columns from 0.781 GB file in 00:01:10
> ca1 <- dudi.coa(mydata, scannf = FALSE, nf = 3)
Error in if ((N <- sum(df)) == 0) stop("all frequencies are zero") : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In sum(c(0L, 0L, 0L, 5L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 44L, 43L, 47L,  :
  integer overflow - use sum(as.numeric(.))

The memory usage was normal. I wonder if the integer64 = "numeric" thing actually works because the error message complains again about integer overflow...but not sure how to test it... I explored the file using:

> typeof(mydata)
[1] "list"
> mode(mydata)
[1] "list"
> class(mydata)
[1] "data.table" "data.frame"
> ncol(mydata)
[1] 14949257
> nrow(mydata)
[1] 11
> typeof(mydata[,1])
[1] "list"

Also, I wonder, wouldn't be the problem related to memory or Unix ulimit? When I first run fread in the whole dataset I got an error message related to memory overflow. After many hours searching for solutions I found that increasing the stack size of Linux could fix it, and it did. These are the current ulimits in my machine:

$ ulimit -a                                     
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             unlimited
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       unlimited
-n: file descriptors                80000
-l: locked-in-memory size (kbytes)  64
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 1030562
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited

Thanks for your kind help!

apfuentes commented 6 years ago

Hi, The error persist. How could it be fixed?

zkamvar commented 6 years ago

Hello @apfuentes,

Okay, so modifying the argument in data.table didn't work. The way you can check if the columns of your data frame are numeric is by running sapply(mydata, is.integer), but I get the feeling it will tell you that you have all integers anyways. So, I have two options for you:

With the data you have, convert all the columns to numeric values:

mydata <- data.frame(lapply(mydata, as.numeric))

This will take up more memory on your machine, but it should work.

Use read_tsv() or read_table() from the readr package: http://readr.tidyverse.org/reference/read_table.html

read_tsv("./path/11pops.minDP20.genpop", header = TRUE, skip = 1, col_types = cols(.default = "double"))

Since it appears that you only have numeric data anyways, this will force all columns in your data to be represented as numeric.

Let me know if this helps, Zhian

apfuentes commented 6 years ago

Hi @zkamvar, Thanks very much for your recommendations. Here an update. I tried both approaches, this is the code used in each case:

Option 1:

library(data.table)
path_to_file <- "path/to/dataset"
mydata <- fread(path_to_file, header=TRUE, sep="\t", drop=c(1), stringsAsFactors=FALSE)  # Load data
mydata <- data.frame(lapply(mydata, as.numeric))  # Transform data.table into data.frame and integers to numeric
ca1 <- dudi.coa(mydata,scannf=FALSE,nf=3)  # CoA

Option 2:
```
library(readr)
path_to_file <- "path/to/dataset"
mydata <- read_tsv(path_to_file, col_names = TRUE, col_types = cols(.default = "double", X1 = col_skip()), progress = show_progress())  # Load data
ca1 <- dudi.coa(mydata,scannf=FALSE,nf=3)  # CoA
```
Both scripts have been running for 24 h. Option 1 is using 27 GB of RAM and it is now at the ca1 step, this suggests the data transformation worked and the CoA is running. In contrast, Option 2 is using 7 GB and it is stuck loading the data...Is the script for Option 2 correct? 24h to load the data is too much...considering that loading the data with fread() takes 2 min...
I have been exploring the possibility of using an alternative R package to dudi.coa() to do a CoA. So far, the only package that has worked with the whole dataset of 7M SNPs is CA() implemented in FactoMineR, website. It requires 25GB of RAM and it takes 8h to finish. I guess both CoA would be equivalent, but I am not sure...thus I wanted to ask, is there any way to confirm the equivalence? Or this could just be assumed?

Pd. I used exactly the same dataset and format for both packages:

    marker1.allele1 marker1.allele2 marker2.allele1 marker2.allele2 ...
Pop1    0   20  12  40
Pop2    2   30  7   40
Pop3    0   10  0   33
Pop4    0   50  0   20
.
.
.
Pop11

Thanks for all your help!

apfuentes commented 6 years ago

Hi @zkamvar, I just wanted to confirm that Option 1 worked and it took 39h to finish the CoA using dudi.coa(). I compared the eigenvalues obtained with both packages and they are the same.

Thanks for all your help!

zkamvar commented 6 years ago

Hi @apfuentes,

This is great to hear! Sorry for not responding earlier, there are a lot of things up in the air right now.

Thank you for not only posting the details of your code, but also for posting your solutions. They really helped diagnose the issue 😃

apfuentes commented 6 years ago

Hi, Awhile ago I reported the issue I was having with dudi.coa() to Stephane Dray, one of the developers of the ade4 package. He has fixed the bug and the fix is now in github. I thought should let you know it and maybe adegenet could incorporate the fix.

Best,

zkamvar commented 6 years ago

Hi @apfuentes, thank you for letting us know! Luckily, if anyone has the latest version of ade4 installed, the functionality will be automatically incorporated.

thibautjombart / adegenet

ERROR when using adegenet to conduct a CoA based on allele counts #222

Obtain total sum of allele counts per column and store it in a vector.

Verify if any total sum of a column is equal to zero.

Import dataset.

Transform dataframe to numeric.