reading large file using bigmemory

bioinfonext commented 2 years ago

Hi,

I am trying to read a large file using bigmemory and I am getting errors as this file's first two columns are non-numeric, so I have deleted the second column but the first column I want to make as row names.

Is there any option in bigmemory to make the first column as row names and how I can avoid the below warning message?

>library("bigmemory")

> library("biganalytics")

> data.matrix - read.big.matrix("methylation.txt",header=T,sep='\t')

Error in data.matrix - read.big.matrix("methylation.txt", header = T,  :

  non-numeric argument to binary operator

In addition: Warning messages:

1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion

2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion

3: In read.big.matrix("methylation.txt", header = T, sep = "\t") :

Many thanks,

privefl commented 2 years ago

I don't think you can have rownames for a big.matrix. You should probably just store these somewhere else.

bioinfonext commented 2 years ago

Hi @privefl,

I need to correlate this matrix data with phenotypic data so that's why I want to make SampleID as row names. Can I assign rownames after reading this matrix using a separate list.

Someone suggested some solutions here but I am not sure what does it mean.

https://stackoverflow.com/questions/12576735/bigmemory-and-rownames-dimnames-of-matrix

Many thanks

privefl commented 2 years ago

Just use match() to get the row indices that correspond to the external SampleID.

(or the opposite, i.e. reorder the phenotypic data instead)

bioinfonext commented 2 years ago

Hi, Thanks @privefl

We just have 2000 rows so we need these for further analysis. Many thanks,

bioinfonext commented 2 years ago

I have removed first two non-numeric column but still, it shows the same error;

> data.matrix - read.big.matrix("phylo.txt",header=T,sep='\t')
Error in data.matrix - read.big.matrix("phylo.txt", header = T, sep = "\t") :
  non-numeric argument to binary operator
In addition: Warning messages:
1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion
2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion
3: In read.big.matrix("phylo.txt", header = T, sep = "\t") :
  Because type was not specified, we chose double based on the first line of data.

File look like this now after removing first two character column; it has around 80000 column and 2000 rows
cg02115394 cg12480843
0.974035    0.718462
0.967383    0.765799
0.961012    0.84822
0.960447    0.722946
0.963181    0.939808
0.940292    0.878546

privefl commented 2 years ago

Would be a good idea to read only the first e.g. 5 rows with data.table::fread() to have an idea of the number and types of columns.

bioinfonext commented 2 years ago

I have removed all non-numeric column and I can able to read 5 rows using fread, but bigmemory don't work here.

mydt10 <- fread("phylo.num.txt", nrows = 5)
> dim(mydt10)
[1]      5 844488
> str(mydt10)
Classes ‘data.table’ and 'data.frame':  5 obs. of  844488 variables:
 $ cg14361672      : num  0.974 0.967 0.961 0.96 0.963
 $ cg12950382      : num  0.718 0.766 0.848 0.723 0.94
 $ cg02115394      : num  0.0337 0.0258 0.025 0.0317 0.0357
 $ cg12480843      : num  0.0182 0.0189 0.0137 0.0167 0.0151
 $ cg26724186      : num  0.98 0.977 0.982 0.982 0.978
 $ cg00617867      : num  0.96 0.979 0.98 0.977 0.977
 $ cg13773083      : num  0.313 0.246 0.253 0.234 0.372
 $ cg17236668      : num  0.974 0.975 0.975 0.979 0.978
 $ cg19607165      : num  0.0866 0.0966 0.0804 0.1162 0.0792
 $ cg08770523      : num  0.0243 0.0213 0.0203 0.0194 0.0197

privefl commented 2 years ago

table(sapply(mydt10, typeof))?

bioinfonext commented 2 years ago

> table(sapply(mydt10, typeof))

double
844488

privefl commented 2 years ago

Hum.. Maybe worth trying bigstatsr::big_read() (https://privefl.github.io/bigstatsr/articles/read-FBM-from-file.html).

bioinfonext commented 2 years ago

Still, getting errors even with bigreadr?

> data2 <- big_fread2("phylo.num.txt", nb_parts = NULL, .transform = identity,.combine = cbind_df, skip = 0, select = NULL, progress = FALSE, part_size = 500 * 1024^2)
 *** caught segfault ***
address 0x7f5e51c63df7, cause 'memory not mapped'

Traceback:
 1: data.table::fread(input, ..., data.table = data.table, nThread = nThread)
 2: fread2(file, skip = skip, select = cols, ..., showProgress = FALSE)
 3: .transform(fread2(file, skip = skip, select = cols, ..., showProgress = FALSE))
 4: FUN(X[[i]], ...)
 5: lapply(split_cols, function(cols) {    part <- .transform(fread2(file, skip = skip, select = cols,         ..., showProgress = FALSE))    already_read <<- already_read + length(cols)    if (progress)         utils::setTxtProgressBar(pb, already_read)    part})
 6: big_fread2("phylo.num.txt", nb_parts = NULL, .transform = identity,     .combine = cbind_df, skip = 0, select = NULL, progress = FALSE,     part_size = 500 * 1024^2)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

Many thanks,

bioinfonext commented 2 years ago

Is it possible to run a loop to read this file using fread in R?

Many thanks,

glm729 commented 1 year ago

@bioinfonext:

> data.matrix - read.big.matrix("phylo.txt",header=T,sep='\t')
Error in data.matrix - read.big.matrix("phylo.txt", header = T, sep = "\t") :
  non-numeric argument to binary operator

Is this meant to be:

data.matrix <- read.big.matrix("phylo.txt", header = TRUE, sep = "\t")
#           ^^

It looks like you had a typo, given your original error -- the assignment operator was missing the <.

kaneplusplus / bigmemory

reading large file using bigmemory #110