maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
102 stars 40 forks source link

--useMtx is ignored if too.big == TRUE in runSeurat.R #262

Open RoganGrant opened 1 year ago

RoganGrant commented 1 year ago

First of all, thank you for this incredibly useful package. We get a lot of use out of it.

For large matrices (where too.big = TRUE), I've run into an issue where you can't force --useMtx to be TRUE. This is because the first line of this chunk will always read TRUE in runSeurat.R:

if (use.mtx || too.big) {
        # we have to write the matrix to an mtx file
        matrixPath <- file.path(dir, paste(prefix, "matrix.mtx", sep=""))
        genesPath <- file.path(dir, paste(prefix, "features.tsv", sep=""))
        barcodesPath <- file.path(dir, paste(prefix, "barcodes.tsv", sep=""))
        message("Writing expression matrix to ", matrixPath)
        writeMM(counts, matrixPath)
        # easier to load if the genes file has at least two columns. Even though seurat objects
        # don't have yet explicit geneIds/geneSyms data, we just duplicate whatever the matrix has now
        write.table(as.data.frame(cbind(rownames(counts), rownames(counts))), file=genesPath, sep="\t", row.names=F, col.names=F, quote=F)
        write(colnames(counts), file = barcodesPath)
        message("Gzipping expression matrix")
        gzip(matrixPath)
        gzip(genesPath)
        gzip(barcodesPath)
  } else {
      # we can write the matrix as a tsv file
      gzPath <- file.path(dir, paste(prefix, "exprMatrix.tsv.gz", sep=""))
      if (too.big) {
          if (.Platform$OS.type=="windows")
              error("Cannot write very big matrices to a text file on Windows. Please use the --useMtx (R: use.mtx) option")
          writeSparseTsvChunks(counts, gzPath);
      } else {
          mat = as.matrix(counts)

Would it be possible to allow the use to force a tsv instead, such as changing (use.mtx || too.big) to (use.mtx || (.Platform$OS.type=="windows" && too.big))? I ask largely because cbBuild consistently fails for me with .mtx files, and I can't figure out precisely how to configure the cellbrowser.conf to fix this issue.

Thank you!

RoganGrant commented 1 year ago

Realizing now that the --forceMtx flag does not take a text argument, and rather is true if specified, false if not. In any case, it would be great to have an equivalent --forceTSV flag

maximilianh commented 1 year ago

Hi Rogan, thanks for your feedback. This is intentional, but I'm interested in your feedback, my knowledge of R (as you can see from my R code) is somewhat limited.

Apparently you have a very big matrix, right?

The problem with big matrices in R is that if they exceed the maximum size of elements 2^31-1 then R stops when the sparse matrix is converted to a normal matrix. Most of the elements are zero, so as long as the non-zero-count of the Matrix is low enough, big Matrices work as long as they're sparse and kept sparse and for that, writing as MTX is (as far as I know - please correct me here, see below) is required. I wrote writeSparseTsvChunks() to be able to write very big matrices, but then R can't read them, so I moved to .mtx.gz everywhere now. This was the idea behind the move towards .mtx.gz files everywhere, both for h5ad and Seurat objects, to make sure that R can always read the matrices.

The MTX format seemed clean enough and easy to read. (BTW: what don't you like about .mtx.gz ?)

cbImportSeurat produces an .Rscript file that you can edit and run manually, you could try it now to force the .tsv.gz file and try to read the result in R - does that work for you? Or do you not care if others can read the resulting .tsv.gz files with R?

Or maybe I'm missing something and it's not too hard to read gigantic .tsv.gz files and there is some trick in R to convert them into sparse matrices in pieces when reading them?

On Wed, Feb 8, 2023 at 7:21 AM Rogan Grant @.***> wrote:

Realizing now that the --forceMtx flag does not take a text argument, and rather is true if specified, false if not. In any case, it would be great to have an equivalent --forceTSV flag

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/262#issuecomment-1422077082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJQBXT5DPWFT3KKQZTWWM3NFANCNFSM6AAAAAAUU2JKLQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

maximilianh commented 1 year ago

And, just in case: I'm not opposed to changing this, I just would like to understand if it makes sense in some context for R to write matrices that it can write, but not read.

On Wed, Feb 8, 2023 at 11:32 AM Maximilian Haeussler @.***> wrote:

Hi Rogan, thanks for your feedback. This is intentional, but I'm interested in your feedback, my knowledge of R (as you can see from my R code) is somewhat limited.

Apparently you have a very big matrix, right?

The problem with big matrices in R is that if they exceed the maximum size of elements 2^31-1 then R stops when the sparse matrix is converted to a normal matrix. Most of the elements are zero, so as long as the non-zero-count of the Matrix is low enough, big Matrices work as long as they're sparse and kept sparse and for that, writing as MTX is (as far as I know - please correct me here, see below) is required. I wrote writeSparseTsvChunks() to be able to write very big matrices, but then R can't read them, so I moved to .mtx.gz everywhere now. This was the idea behind the move towards .mtx.gz files everywhere, both for h5ad and Seurat objects, to make sure that R can always read the matrices.

The MTX format seemed clean enough and easy to read. (BTW: what don't you like about .mtx.gz ?)

cbImportSeurat produces an .Rscript file that you can edit and run manually, you could try it now to force the .tsv.gz file and try to read the result in R - does that work for you? Or do you not care if others can read the resulting .tsv.gz files with R?

Or maybe I'm missing something and it's not too hard to read gigantic .tsv.gz files and there is some trick in R to convert them into sparse matrices in pieces when reading them?

On Wed, Feb 8, 2023 at 7:21 AM Rogan Grant @.***> wrote:

Realizing now that the --forceMtx flag does not take a text argument, and rather is true if specified, false if not. In any case, it would be great to have an equivalent --forceTSV flag

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/262#issuecomment-1422077082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJQBXT5DPWFT3KKQZTWWM3NFANCNFSM6AAAAAAUU2JKLQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RoganGrant commented 1 year ago

Thank you for the quick response! I have personally converted this matrix to non-sparse in R in the course of certain function calls without issue, but the documentation agrees with you. I honestly don't know how much of a risk this poses in terms of the function failing for others.

In any case I have no issue with .mtx files, but I can't get them to work at all with cbBuild. The cellbrowser.conf file still points to a single tsv file that does not exist, and manually supplying each individual file does not seem to work (next it asks for a barcodes.tsv, which is ignored if I specify directly for each assay). My ultimate solution (which worked very well) was to run mtx2tsv on each assay before deployment.

maximilianh commented 1 year ago

Hi Rogan,

hmm... I have a few questions sorry:

The cellbrowser.conf file still points to a single tsv file that does not exist, Sorry, I don't understand: do you mean that the auto-generated cellbrowser.conf file does not point to the .mtx.gz file? That's probably a bug. How about changing that filename manually in cellbrowser.conf, doesn't that work?

and manually supplying each individual file does not seem to work Sorry I don't know what you mean... it's sufficient to provide the matrix.mtx.gz file, cbBuild will find the other files.

(next it asks for a barcodes.tsv, which is ignored if I specify directly for each assay) Sorry, I don't understand this sentence.

Message ID: @.***>

RoganGrant commented 1 year ago

Sorry, I should have waited to give more concrete examples. My object has three assays (counts, data, and scale). As far as I can tell cbBuild does not handle this correctly if a .mtx file is used. If I run cbBuild without any conversion, I initially get the following error:

FileNotFoundError: [Errno 2] No such file or directory: '[path]/counts_exprMatrix.tsv.gz'

Full trace:

INFO:root:dataRoot is not set in ~/.cellbrowser.conf or via $CBDATAROOT. Dataset hierarchies are not supported. INFO:root:Creating [path] INFO:root:Determining if [path]/exprMatrix.tsv.gz needs to be created INFO:root:[path]/exprMatrix.tsv.gz does not exist. Must build matrix now. INFO:root:Creating [path]/metaFields INFO:root:Checking and reordering meta data to [path]/meta.tsv INFO:root:Reading sample names from [path]/meta.tsv INFO:root:Reading headers from file [path]/counts_exprMatrix.tsv.gz ERROR:root:Unexpected error: (<class 'FileNotFoundError'>, FileNotFoundError(2, 'No such file or directory'), <traceback object at 0x7f2afdea0f48>) Traceback (most recent call last): File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 4783, in cbBuildCli build(confFnames, outDir, port, redo=options.redo) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 4598, in build convertDataset(inDir, inConf, outConf, datasetDir, redo) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 3944, in convertDataset sampleNames, needFilterMatrix = convertMeta(inDir, inConf, outConf, datasetDir, outMetaFname) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 3539, in convertMeta sampleNames, needFilterMatrix = metaReorder(matrixFname, metaFname, finalMetaFname) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 2296, in metaReorder matrixSampleNames = readMatrixSampleNames(matrixFname) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 2288, in readMatrixSampleNames return readHeaders(fname)[1:] File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 3135, in readHeaders ifh = openFile(fname, "rtU") File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 807, in openFile fh = gzip.open(fname, mode, encoding=encoding) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 53, in open binary_file = GzipFile(filename, gz_mode, compresslevel) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 163, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '[path]/counts_exprMatrix.tsv.gz'

The initial cellbrowser.conf file is as follows:

# This is a bare-bones cellbrowser config file auto-generated by the command-line tool cbImportSeurat
# or directly from R with SeuratWrappers::ExportToCellbrowser().
# Look at https://github.com/maximilianh/cellBrowser/blob/master/src/cbPyLib/cellbrowser/sampleConfig/cellbrowser.conf
# for a full file that shows all possible options
name="name"
shortLabel="name"
exprMatrix="counts_exprMatrix.tsv.gz"
matrices=[ {'label':'counts','fileName':'counts_exprMatrix.tsv.gz'},
 {'label':'data','fileName':'data_exprMatrix.tsv.gz'},
 {'label':'scale','fileName':'scale_exprMatrix.tsv.gz'}]
#tags = ["10x", "smartseq2"]
meta="meta.tsv"
# possible values: "gencode-human", "gencode-mouse", "symbol" or "auto"
geneIdType="auto"
# file with gene,description (one per line) with highlighted genes, called "Dataset Genes" in the user interface
# quickGenesFile="quickGenes.csv"
clusterField="typestate"
labelField="typestate"
enumFields=["orig.ident", "HTO_maxID", "HTO_secondID", "HTO_classification", "HTO_classification.global", "hash.ID", "MULTI_ID", "MULTI_classification"$
markers = [{"file": "markers.tsv", "shortLabel": "Seurat Cluster Markers"}]
coords=[{"file": "umap.coords.tsv", "shortLabel": "Seurat umap"},
{"file": "SCVI.coords.tsv", "shortLabel": "Seurat SCVI"}]

If I modify the cellbrowser.conf matrices and exprMatrix arguments as follows (note that scale is a smaller matrix, still gets output as a tsv):

exprMatrix="counts_matrix.mtx.gz"
matrices=[ {'label':'counts','fileName':'counts_matrix.mtx.gz'},
 {'label':'data','fileName':'data_matrix.mtx.gz'},
 {'label':'scale','fileName':'scale_exprMatrix.tsv.gz'}]

I run into a new error, where it seems cbBuild does not recognize the additional assays:

FileNotFoundError: [Errno 2] No such file or directory: '[path]/barcodes.tsv.gz'

Full trace:

INFO:root:dataRoot is not set in ~/.cellbrowser.conf or via $CBDATAROOT. Dataset hierarchies are not supported. INFO:root:Determining if /var/www/apps/test/name/matrix.mtx.gz needs to be created INFO:root:/var/www/apps/test/name/matrix.mtx.gz does not exist. Must build matrix now. INFO:root:Checking and reordering meta data to /var/www/apps/test/name/meta.tsv INFO:root:Reading sample names from [path]/meta.tsv INFO:root:Reading sample names for [path] -> [path]/barcodes.tsv.gz ERROR:root:Unexpected error: (<class 'FileNotFoundError'>, FileNotFoundError(2, 'No such file or directory'), <traceback object at 0x7fdd7ffa56c8>) Traceback (most recent call last): File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 4783, in cbBuildCli build(confFnames, outDir, port, redo=options.redo) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 4598, in build convertDataset(inDir, inConf, outConf, datasetDir, redo) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 3944, in convertDataset sampleNames, needFilterMatrix = convertMeta(inDir, inConf, outConf, datasetDir, outMetaFname) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 3539, in convertMeta sampleNames, needFilterMatrix = metaReorder(matrixFname, metaFname, finalMetaFname) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 2296, in metaReorder matrixSampleNames = readMatrixSampleNames(matrixFname) File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 2281, in readMatrixSampleNames lines = openFile(barcodePath).read().splitlines() File "/home/deploy/.local/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 807, in openFile fh = gzip.open(fname, mode, encoding=encoding) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 53, in open binary_file = GzipFile(filename, gz_mode, compresslevel) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 163, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '[path]/barcodes.tsv.gz'

Finally, if I add additional fields to specify the naming structure, it seems to be ignored (but perhaps I am using the wrong arguments):

exprMatrix="counts_matrix.mtx.gz"
matrices=[ {'label':'counts','fileName':'counts_matrix.mtx.gz'},
 {'label':'data','fileName':'data_matrix.mtx.gz'},
 {'label':'scale','fileName':'scale_exprMatrix.tsv.gz'}]
barcodes=[ {'label':'counts','fileName':'counts_barcodes.tsv.gz'},
 {'label':'data','fileName':'data_barcodes.tsv.gz'}]
features=[ {'label':'counts','fileName':'counts_features.tsv.gz'},
 {'label':'data','fileName':'data_features.tsv.gz'}]

Same error:

FileNotFoundError: [Errno 2] No such file or directory: '[path]/barcodes.tsv.gz'

Thank you for your help with this