Princomp fails on real-world complexity dataset

ciropom commented 2 years ago

Hello, when the number of expression profiles grows, we have troubles executing remote PCA. For instance if we use the liver.toxicity dataset from mixOmics (you can export to csv with the following code and then import to opal)

library(mixOmics)

data(liver.toxicity)
X <- liver.toxicity$gene
dim(X)
fwrite(cbind(data.frame(Patient=seq(1,nrow(X))),X), file="mixOmics.liver.toxicity.csv", sep = ",")

and execute the remote principal component

library(dsSwissKnifeClient) 
library(dsBaseClient) 

opals <- datashield.login(logins=data.frame(server='DEMO',url='http://127.0.0.1:8080',user='administrator',password='password',table='mixOmics.mixOmics.liver.toxicity'))

datashield.assign(opals, 'D', 'mixOmics.mixOmics.liver.toxicity')

remote_pca <- dssPrincomp('D', async=F, datasources=opals)

datashield.errors()

after a long time, we always get an error like the one below

  Aggregated (partColMeans(D, FALSE)) [==================================================] 100% / 2s
  Aggregated (partCov(D, "Wy0wLjA0NDksLTAuMDA3OSwtMC4wMDk3LC0wLjAwMTIsLTAuMDMyOSwtMC4wMTc0LDAuMDA...
$DEMO
[1] "[Client error: (400) Bad Request]"

IulianD commented 2 years ago

Hello, You should be able to find more information about the error in your rserver log (not sure about your specific installation but it could be here: /var/lib/rserver/logs/Rserve.log). Could you paste it here please?

Best, Iulian

ciropom commented 2 years ago

In Rserve.log nothing related

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(Rserve) ; Rserve(args='--vanilla --RS-workdir /opt/rock-home/work/R --RS-conf /opt/rock-home/conf/Rserv.conf')
Starting Rserve:
 /opt/R/4.1.2/lib/R/bin/R CMD /storage/R/x86_64-pc-linux-gnu-library/4.1/Rserve/libs//Rserve --vanilla --RS-workdir /opt/rock-home/work/R --RS-conf /opt/rock-home/conf/Rserv.conf 

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Rserv started in daemon mode.
> 
> 
Loading required package: parallel
Loading required package: parallel
Loading required package: unixtools
Loading required package: resourcer
Loading required package: R6
Loading required package: httr
Registering LocalFileResourceGetter...
Registering HttpFileResourceGetter...
Registering ScpFileResourceGetter...
Registering GridFsFileResourceGetter...
Registering OpalFileResourceGetter...
Registering MariaDBResourceConnector...
Registering PostgresResourceConnector...
Registering SparkResourceConnector...
Registering PrestoResourceConnector...
Registering TidyFileResourceResolver...
Registering ShellResourceResolver...
Registering SshResourceResolver...
Registering RDataFileResourceResolver...
Registering RDSFileResourceResolver...
Registering SQLResourceResolver...
Registering NoSQLResourceResolver...
Loading required package: parallel
Loading required package: unixtools
Loading required package: unixtools
Loading required package: sqldf
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: parallel
Loading required package: parallel
Loading required package: unixtools
Loading required package: resourcer
Loading required package: R6
Loading required package: httr
Registering LocalFileResourceGetter...
Registering HttpFileResourceGetter...
Registering ScpFileResourceGetter...
Registering GridFsFileResourceGetter...
Registering OpalFileResourceGetter...
Registering MariaDBResourceConnector...
Registering PostgresResourceConnector...
Registering SparkResourceConnector...
Registering PrestoResourceConnector...
Registering TidyFileResourceResolver...
Registering ShellResourceResolver...
Registering SshResourceResolver...
Registering RDataFileResourceResolver...
Registering RDSFileResourceResolver...
Registering SQLResourceResolver...
Registering NoSQLResourceResolver...
Loading required package: parallel
Loading required package: unixtools
Loading required package: unixtools
Loading required package: sqldf
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: parallel
Loading required package: parallel
Loading required package: unixtools
Loading required package: resourcer
Loading required package: R6
Loading required package: httr
Registering LocalFileResourceGetter...
Registering HttpFileResourceGetter...
Registering ScpFileResourceGetter...
Registering GridFsFileResourceGetter...
Registering OpalFileResourceGetter...
Registering MariaDBResourceConnector...
Registering PostgresResourceConnector...
Registering SparkResourceConnector...
Registering PrestoResourceConnector...
Registering TidyFileResourceResolver...
Registering ShellResourceResolver...
Registering SshResourceResolver...
Registering RDataFileResourceResolver...
Registering RDSFileResourceResolver...
Registering SQLResourceResolver...
Registering NoSQLResourceResolver...
Loading required package: parallel
Loading required package: unixtools
Loading required package: unixtools
Loading required package: sqldf
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: parallel
Loading required package: unixtools
Loading required package: readr
Loading required package: labelled
Loading required package: parallel
Loading required package: unixtools
Loading required package: parallel
Loading required package: unixtools
Loading required package: readr
Loading required package: labelled
Loading required package: parallel
Loading required package: unixtools
Loading required package: readr
Loading required package: labelled
Loading required package: parallel
Loading required package: unixtools

Today the error is different though

  Aggregated (partCov(Variables, "Wy0wLjA0NDksLTAuMDA3OSwtMC4wMDk3LC0wLjAwMTIsLTAuMDMyOSwtMC4wMTc...
  Assigned expr. (Variables.scores <- pcaScores(Variables, "W1siLTAuMDEwMjQrMGkiLCIwLjAxNjExKzBpI...
Warning: Error in value[[3L]]: C stack usage  28231434 is too close to the limit

IulianD commented 2 years ago

Hello,

This looks like an R memory limitation. I have a few suggestions in increasing order of probability of success :

I committed a small code change in dsSwissKnife that might (or might not) help. Please pull and install the last version and give it a try (I'd actually be quite surprised if this worked but ...)
I think the problem comes from the t() function (transpose). Maybe it's implemented recursively in C. If this is the case you could try to increase the stack limit on the server. Run this before running your code: dssSetOption(list(expressions = 500000))
- If everything else fails you can try a workaround: connect multiple times to your server, load in each session a chunk of your data and rerun your code (with type = 'combine') I hope at least some of the above will help. Please let me know.

ciropom commented 2 years ago

Hello and thank you for your help. The workaround is not going to work, because the problem is not the number of rows (pretty limited, 64) but the number of columns (above 3000). Essentially the princomp function (but I believe also the other functions) slows down a lot as the number of columns of the table grows.

I'll try the other things and let you know. Danilo

ciropom commented 2 years ago

after upgrading and adding the option, the same error

opals <- datashield.login(logins=data.frame(server='DEMO',url='http://127.0.0.1:8080',user='administrator',password='password',table='mixOmics.mixOmics.liver.toxicity'))
dssSetOption(list(expressions = 500000), datasources=opals)

datashield.assign(opals, 'D', 'mixOmics.mixOmics.liver.toxicity')

remote_pca <- dssPrincomp('D', async=F, datasources=opals)
  Aggregated (partColMeans(D, FALSE)) [==================================================] 100% / 0s
  Aggregated (partCov(D, "Wy0wLjA0NDksLTAuMDA3OSwtMC4wMDk3LC0wLjAwMTIsLTAuMDMyOSwtMC4wMTc0LDAuMDA...
  Assigned expr. (D_scores <- pcaScores(D, "W1siMC4wMDU5OTgrMGkiLCIwLjAwMDIyMDIrMGkiLCIwLjAwNTM4K...
Error: There are some DataSHIELD errors, list them with datashield.errors()
> datashield.errors()
$DEMO
[1] "[Client error: (400) Bad Request]"

ciropom commented 2 years ago

If you can reproduce the issue locally with the same dataset it will help to undestand if it is reproducible or it is an issue with my setup. You will find the instructions in the first post. Thank you Danilo

IulianD commented 2 years ago

I am going to try in the following days but it seems reasonable to me that we are hitting some R limitations. Out of curiosity, does simple princomp() work on the same dataset?

IulianD commented 2 years ago

and perhaps a better question: are you able to calculate the covariance matrix on the dataset?

IulianD commented 2 years ago

Actually I can see that princomp() doesn't work on more variables than rows in any case and I use pretty much the same method. Granted, this is not the problem you're hitting. I mean, I think you are hitting an R limitation but even if you didn't the function would still fail. You could probably test this by taking a small subset of your data but keeping more columns than rows.

ciropom commented 2 years ago

You are right this dataset is not suitable for PCA. It looks like that for PCA to work, the number of instances should be significantly larger than the number of dimensions. This is not a dsSwissKnife issue.

sib-swiss / dsSwissKnifeClient

Princomp fails on real-world complexity dataset #3