BPCells: Error opening on-disk matrix

satijalab / seurat

R toolkit for single cell genomics

http://www.satijalab.org/seurat

Other

2.29k stars 915 forks source link

BPCells: Error opening on-disk matrix #7373

Closed mihem closed 1 year ago

mihem commented 1 year ago

When I run FindTransferAnchors on my Seurat Object v5, the function stops with:

Error in (function (cond)  :
  error in evaluating the argument 'x' in selecting a method for function 't': Error opening file: matrix/PNP14/index_data

This happens within the function ProjectCellEmbeddings.

I followed the BPCells tutorial https://satijalab.org/seurat/articles/seurat5_bpcells_interaction_vignette.html#load-data-from-multiple-h5ad-files followed by the sketch integration tutorial https://satijalab.org/seurat/articles/parsebio_sketch_integration .

The relevant parts of the analysis are probably, opening the hdf5 files

sc_raw_list <- lapply(h5_path, BPCells::open_matrix_10x_hdf5) |>
    setNames(sample_names)

and writing the matrices to the disk

map2(.x = sc_raw_list,
     .y = file.path("matrix", names(sc_raw_list)),
      .f = BPCells::write_matrix_dir)

and then opening them

sc_raw_mat <-
  map(.x = file.path("matrix", names(sc_raw_list)),
      .f = open_matrix_dir) |>
  setNames(sample_names)

I followed the tutorial and everything worked fine until this error occured. Of note, I used .qs files to store the Seurat Object instead of .rds files because they are much faster. Also I merged the objects, joined them, split them again because of a previous mentioned issue https://github.com/satijalab/seurat/issues/7318. I then saved it to a custom directory (now relative path) as mentioned in the BPCells tutorial via:

SeuratObject::saveRDS(
  object = sc_merge,
  file = "sc_merge.rds",
  destdir = "merge_object"
)

When I run FindIntegrationAnchor with the sketch assay, there's no error. I tried to fix it up by saving the seurat object to a different location:

SeuratObject::saveRDS(
  object = sc_merge,
  file = "sc_merge.rds",
  destdir = "matrix_final"
)

But this just saves the .rds file (much slower than qs), but no matrix folders and gives the warning message

## Warning messages:
## 1: The matrix provided does not exist on-disk 
## 2: The matrix provided does not exist on-disk

@Gesmira I saw that you commited a pull request because of a similar error. https://github.com/satijalab/seurat/issues/7165 However, I actually wrote the matrix to a dir instead of relying on the h5 files. Any idea why the connection is broken and how I can restore that? Is there a problem with qs files or rather with merge, Join and split? Is it recommended to JoinLayers before running FindTransferAnchors?

Deeply hidden in the Seurat Object in counts I can find the directory, and the directory exists and files weren't changed since they were created.

sc_merge$RNA@layers$counts@matrix@matrix_list[[1]]@matrix@matrix_list[[1]]@matrix@matrix_list[[1]]@matrix@dir
merge_object/PNP01

and hidden even more deeply in data point to the initial path, with an absolute path

sc_merge$RNA@layers$data@matrix@matrix_list[[1]]@matrix@matrix_list[[1]]@matrix@matrix_list[[1]]@matrix@matrix@matrix@matrix@dir
~/project/matrix_PNP01

Thank you!

Gesmira commented 1 year ago

Hi @mihem, Based on what you showed about where your matrix data is stored, our current code will not be able to find it. We are working on implementing more specific BPCells functions to retrieve that matrix paths when they are nested within lists that should fix this! I'll keep you updated on that here.

In the meantime, if you use .rds files instead of .qs files, does that fix your issue? You don't have to move the matrices to the same folder as your object, considering you still have access to where the matrices are currently stored.

mihem commented 1 year ago

@Gesmira thanks for you response.

I tried .rds files as you suggested, but that didn't solve this issue.

I actually saved one object at the beginning of the analysis, also in .qs and this worked fine. So I think the issue is not related to qs, but to mergeand Join .

I think this is a rather critical bug because it means that after conducting the entire analysis (like integrate layers, umap, annotation .. which takes some time despite your great features in v5) your Seurat object is corrupt, right?

I tried to debug it. After opening the h5 matrix, writing BP matrix to dir, and open the BP matrix then again, I create a list of SeuratObjects because of the issue mentioned here https://github.com/satijalab/seurat/issues/7318 . After filtering i merge these objects liek this:

sc_merge_pre <- merge(x = sc_filter[[1]], y = sc_filter[-1], merge.data = TRUE, add.cell.ids = names(sc_list))

The resulting object has the dir at the right location I think, althought there is warning:

sc_merge_pre[["RNA"]]$counts@matrix@dir`

"/home/myusername/myproject/matrix/PNP01"
Warning message:
In LayerData.Assay5(object = x, layer = i) :
  multiple layers are identified by counts.1 counts.2 counts.3 counts.4 counts.5 counts.6 counts.7 counts.8 counts.9 counts.10 counts.11 counts.12 counts.13 counts.14 counts.15 counts.16
 only the first layer is used

I can save this object with SeuratObject::saveRDS, which then inform me that Warning: Changing path in object to point to new BPCells directory location which is the expected behavior I think.

But when I now Join these layers (in order to split them again as recommend by @yuhanH here https://github.com/satijalab/seurat/issues/7316 .. although in a different context), the dir is now in a deeply convoluted location

sc_merge_pre[["RNA"]]$counts@matrix@matrix_list[[1]]@matrix@matrix_list[[1]]@matrix@dir

and SeuratObject::saveRDS fails with

Error in gzfile(file, mode) : cannot open the connection
In addition: Warning messages:
1: The matrix provided does not exist on-disk 
2: In gzfile(file, mode) :
  cannot open compressed file 'test6/test5.rds', probable reason 'No such file or directory'

if the dir does not exists. If the dir exists, saveRDS works (so it saves an .rds file, but no associated matrix), but gives a warning:

Warning message:
The matrix provided does not exist on-disk

(This is only a minor thing but i find this behavior also confusing).

So i think after merging and joining layers, SeuratObjects with BPcells matrix are corrupt. Would be great if you could find a way to solve that. Happy to debug further.

mihem commented 1 year ago

@Gesmira

sorry, any update on this issue? I know that there are a lot of Seurat issues, but compared to other issues, I think this an actually bug. Also it's easy to reproduce and relevant for many users I think.

I hope my debugging efforts help you, otherwise I am happy to debug further.

Gesmira commented 1 year ago

Hi, I am able to reproduce the issue. This is related to the issue mentioned where joining layers changes the structure of the matrices in the Seurat object. I will respond here when there is an update to the PR, so you can see if it works for you. Thank you!

On Thu, Jun 1, 2023 at 4:42 AM mihem @.***> wrote:

@Gesmira https://github.com/Gesmira

sorry, know that there are a lot of Seurat issues, but compared to other issues, I think this an actually bug, which is easy to reproduce and relevant for many users.

I hope my debugging efforts help you, otherwise happy to debug further.

— Reply to this email directly, view it on GitHub https://github.com/satijalab/seurat/issues/7373#issuecomment-1571612900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOJJ3OPL4V4LUG6Z6LXW2W3XJBIXPANCNFSM6AAAAAAYNXIJ44 . You are receiving this because you were mentioned.Message ID: @.***>

Gesmira commented 1 year ago

Hi @mihem, The matrix locating should be fixed now in a new branch. Can you let me know if clearing your session, restarting R, and installing this new version of SeuratObject fixes your issues? remotes::install_github("mojaveazure/seurat-object", "feat/saveRDS_updates")

mihem commented 1 year ago

@Gesmira wow, that was super fast.

Unfortunately, doesn't work for me:

After joining layers, saveRDS fails with a new error:

Error in .FilePath.IterableMatrix(x = ldat) : 
  object 'all_matrix_inputs' not found

Note: this error also occurs before joining layers, so I am not able to saveRDS with matrices at all (while that was possible before this fix).

EDIT: Okay, I realized that this is just due to BPCells:::all_matrix_inputs not being present in my BPCells version, so maybe i don't have the latest github version of BPCells.

mihem commented 1 year ago

@Gesmira after updating BPcells to the most recent github version, BPCells:::all_matrix_inputs now works. However, saveRDS of the SeuratObject still doesn't work:

Saving the object before joining (which worked before), now fails with:

Warning: Changing path in object to point to new BPCells directory location
Error in SaveSeuratBP(object, filename = basename(file), destdir = destdir,  : 
  Number of BPCells matrices to replace is not equal to number of matrices

After joining layers, saveRDS fails with:

 Error in path.expand(path) : 
  no slot of name "dir" for this object of class "Iterable_dgCMatrix_wrapper"

and the traceback is:

12: path.expand(path)
11: normalizePath(path = matrix@dir)
10: FUN(X[[i]], ...)
9: lapply(matrices, return_dir_path)
8: unlist(lapply(matrices, return_dir_path))
7: .FilePath.IterableMatrix(x = ldat)
6: .FilePath(x = ldat)
5: FUN(X[[i]], ...)
4: lapply(X = Layers(object = object[[assay]]), FUN = function(lyr) {
       ldat <- LayerData(object = object[[assay]], layer = lyr)
       path <- .FilePath(x = ldat)
       if (is.null(x = path)) {
           return(NULL)
       }
       return(data.frame(layer = rep(lyr, length(path)), path = path, 
           class = rep(paste(class(x = ldat), collapse = ","), length(path)), 
           pkg = rep(.ClassPkg(object = ldat), length(path))))
   })
3: SaveSeuratBP(object, filename = basename(file), destdir = destdir, 
       ...)
2: saveRDS.Seurat(object = sc_merge_pre, file = "test5.rds", destdir = "test7")
1: SeuratObject::saveRDS(object = sc_merge_pre, file = "test5.rds", 
       destdir = "test7")

Gesmira commented 1 year ago

Ok, thank you for reporting back! Can you share the code for creating your Seurat object as well as the output of the object so I can see what the layer format is?

mihem commented 1 year ago

Of course, thanks for your efforts. I cleaned and shortened the script a little to make it more reproducible and keep close to the vignette https://satijalab.org/seurat/articles/seurat5_bpcells_interaction_vignette.html

load libraries and find file paths

library(BPCells)
library(Seurat)
library(SeuratObject)
library(SeuratDisk)
library(Azimuth)
library(tidyverse)

options(Seurat.object.assay.version = "v5")

h5_path <- list.files(pattern = ".h5", recursive = TRUE) |>
    str_subset("cellbender")

Now add some same sample names, will skip that here.

Now open matrix, write, load and create Seurat object. Use a list of objjects instead of one single Seurat Objects because of the reasons mentioned here https://github.com/satijalab/seurat/issues/7318

sc_raw_list <-
  map(h5_path, BPCells::open_matrix_10x_hdf5) |>
    setNames(sample_names)

map2(.x = sc_raw_list,
     .y = file.path("matrix", names(sc_raw_list)),
      .f = BPCells::write_matrix_dir)

sc_raw_mat <-
  map(.x = file.path("matrix", names(sc_raw_list)),
      .f = open_matrix_dir) |>
  setNames(sample_names)

sc_raw_mat <-
  map(.x  = sc_raw_mat,
      .f = Azimuth:::ConvertEnsembleToSymbol,
      species = "human")

sc_list <- map(sc_raw_mat, CreateSeuratObject, min.cells = 3, min.features = 200)

Now some QC metrics and filter our low quality cells and doublets For the debug version of the pipeline, we will skip these steps.

sc_filter <- sc_list

Merge Seurat objects and extract patient names from barcodes

sc_merge_pre <- merge(x = sc_filter[[1]], y = sc_filter[-1], merge.data = TRUE, add.cell.ids = names(sc_list))

sc_merge_pre$patient <- str_extract(colnames(sc_merge_pre), pattern = "[^_]+")

My object now looks like this

An object of class Seurat 
23104 features across 349510 samples within 1 assay 
Active assay: RNA (23104 features, 0 variable features)
 16 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10, counts.11, counts.12, counts.13, counts.14, counts.15, counts.16

If i now want to save the object , the following error occurs (while that worked with the main branch of SeuratObjects

SeuratObject::saveRDS(
  object = sc_merge_pre,
  file = "sc_merge_pre.rds",
  destdir = "sc_merge_pre"
)

Warning: Changing path in object to point to new BPCells directory location
Error in SaveSeuratBP(object, filename = basename(file), destdir = destdir,  : 
  Number of BPCells matrices to replace is not equal to number of matrices

The objects needs to be rejoined and then split again to get the correct names (better naming and required for integration)

sc_merge_pre <- JoinLayers(sc_merge_pre)
sc_merge_pre <- split(x = sc_merge_pre, f = sc_merge_pre$patient)

The object now looks like this

An object of class Seurat 
23104 features across 349510 samples within 1 assay 
Active assay: RNA (23104 features, 0 variable features)
 16 layers present: counts.PNP01, counts.PNP02, counts.PNP03, counts.PNP04, counts.PNP05, counts.PNP06, counts.PNP07, counts.PNP08, counts.PNP09, counts.PNP10, counts.PNP11, counts.PNP12, counts.PNP13, counts.PNP14, counts.PNP15, counts.PNP16

If I now want to save the objects, the following error occurs:

SeuratObject::saveRDS(
  object = sc_merge_pre,
  file = "sc_merge_pre.rds",
  destdir = "sc_merge_pre"
)

Error in path.expand(path) : 
  no slot of name "dir" for this object of class "Iterable_dgCMatrix_wrapper"

I can find the dir hidden in the following location (somewhat different from the location where it used to be when i used the main branch of SeuratObjects)

sc_merge_pre[["RNA"]]$counts@matrix@matrix@matrix_list[[1]]@matrix@matrix_list[[1]]@matrix@matrix@dir

Gesmira commented 1 year ago

Hi @mihem, Thank you for all the details! I now see the specific way of splitting and joining layers led to a specific class I had not accounted for. Can you let me know if clearing your session, restarting R, and installing this updated branch works for you now? remotes::install_github("mojaveazure/seurat-object", "feat/saveRDS_updates")

mihem commented 1 year ago

Hi @Gesmira great, that solved the saving issue. Great work, thanks.

However, there are still two issues i have, which are related to this:

1 Object size Ater joining, saving takes much longer and the resulting object is much bigger: So saving the object after merging only takes:

   user  system elapsed 
 19.396   0.532  19.928

After joining and splitting, saving takes around 20x longer

   user  system elapsed 
389.474   1.012 390.535

With this small object this is still okay, but if you then continue with your analysis and add layers saving could take more than 1 hour.

This increase is both caused by the time spent between "Warning: Changing path in object to point to new BPCells directory location" is much longer (which has probably to do with the code you changed? although i can see that the matrix directory and created in the first few second) and by saving the .rds object (so after all all "Warning Chaning path .." are over), which is now 297M large and was previously only 23M. I find this confusing because the only thing I did is rejoining and splitting, so no new information stored there.

2 IntegrateLayers with scvi fails After continuing with this object as described in this vignette: https://satijalab.org/seurat/articles/parsebio_sketch_integration (only meitonable modification I did, was split before normalization, as you also do in this vignette https://satijalab.org/seurat/articles/seurat5_integration.html and in previous Seurat versions)

sc_merge_pre <- JoinLayers(sc_merge_pre)
sc_merge_pre <- split(x = sc_merge_pre, f = sc_merge_pre$patient)
sc_merge <- NormalizeData(sc_merge_pre, verbose = TRUE, normalization.method = "LogNormalize", scale.factor = 10000)
sc_merge <- FindVariableFeatures(sc_merge, selection.method = "vst", nfeatures = 2000)
sc_merge <- SketchData(object = sc_merge, ncells = 5000, method = "LeverageScore", sketched.assay = "sketch")
sc_merge <-
  sc_merge |>
  FindVariableFeatures(selection.method = "vst", nfeatures = 2000) |>
  ScaleData() |>
  RunPCA()

IntegrateLayers via scvi fails (while it worked fine for me previously with the SeuratObject main branch, also harmony workes fine):

sc_merge <- IntegrateLayers(
  object = sc_merge,
  method = scVIIntegration,
  orig.reduction = "pca",
  new.reduction = "integrated.scvi",
  conda_env = "/home/mischko/miniconda3/envs/scvi",
  verbose = TRUE
)

Error in file(con, "w") : cannot open the connection
In addition: Warning message:
In file(con, "w") :
  cannot open file '/tmp/RtmpdGQ7rO/file60672e553190.sh': No such file or directory

traceback:

10: file(con, "w")
9: writeLines(commands, fi)
8: conda_run2_nix(...)
7: conda_run2("python", c("-c", shQuote("import os; print(os.environ['PATH'])")), 
       conda = conda_info$conda, envname = conda_info$root, intern = TRUE)
6: python_munge_path(python)
5: python_config(python)
4: use_python(python, required = required)
3: reticulate::use_condaenv(conda_env, required = TRUE)
2: method(object = object[[assay]], assay = assay, orig = obj.orig, 
       layers = layers, scale.layer = scale.layer, features = features, 
       groups = groups, ...)
1: IntegrateLayers(object = sc_merge, method = scVIIntegration, 
       orig.reduction = "pca", new.reduction = "integrated.scvi", 
       conda_env = "/home/mischko/miniconda3/envs/scvi", verbose = TRUE)

my object looks like this:

An object of class Seurat 
46208 features across 283779 samples within 2 assays 
Active assay: sketch (23104 features, 2000 variable features)
 33 layers present: counts.PNP01, counts.PNP02, counts.PNP03, counts.PNP04, counts.PNP05, counts.PNP06, counts.PNP07, counts.PNP08, counts.PNP09, counts.PNP10, counts.PNP11, counts.PNP12, counts.PNP13, counts.PNP14, counts.PNP15, counts.PNP16, data.PNP01, data.PNP02, data.PNP03, data.PNP04, data.PNP05, data.PNP06, data.PNP07, data.PNP08, data.PNP09, data.PNP10, data.PNP11, data.PNP12, data.PNP13, data.PNP14, data.PNP15, data.PNP16, scale.data
 1 other assay present: RNA
 1 dimensional reduction calculated: pca

Gesmira commented 1 year ago

Interesting, this was also reported here. It seems that potentially running SketchData with certain objects is sometimes blocking the ability to further open temporary files. Can you see if saving the object, clearing and restarting R, then loading it back in and running the integration step fixes the issue?

mihem commented 1 year ago

@Gesmira thank you, that worked. That's a weird error, but then this is not related to this issue, sorry.

And do you have any ideas regarding the increase of object size/time to save RDS ?

mihem commented 1 year ago

@Gesmira Sorry for keep posting issues in this topic. But I keep having problems with the on-disk matrix. So saving RDS works after joining and splitting. And IntegrateLayers also worked after restarting R.

I then followed the vignette https://satijalab.org/seurat/articles/parsebio_sketch_integration and ran UMAP (after integrating via harmony and scvi). My object looks like this:

An object of class Seurat 
46208 features across 283779 samples within 2 assays 
Active assay: sketch (23104 features, 2000 variable features)
 33 layers present: counts.PNP01, counts.PNP02, counts.PNP03, counts.PNP04, counts.PNP05, counts.PNP06, counts.PNP07, counts.PNP08, counts.PNP09, counts.PNP10, counts.PNP11, counts.PNP12, counts.PNP13, counts.PNP14, counts.PNP15, counts.PNP16, data.PNP01, data.PNP02, data.PNP03, data.PNP04, data.PNP05, data.PNP06, data.PNP07, data.PNP08, data.PNP09, data.PNP10, data.PNP11, data.PNP12, data.PNP13, data.PNP14, data.PNP15, data.PNP16, scale.data
 1 other assay present: RNA
 5 dimensional reductions calculated: pca, integrated.scvi, umap.scvi, harmony, umap.harmony

I then wanted to run project integration, but this failed du to a missing director error:

sc_merge <- ProjectIntegration(
  object = sc_merge,
  sketched.assay = "sketch",
  assay = "RNA",
  reduction = "integrated.scvi")

Error: Missing directory: NA

traceback

37: stop(structure(list(message = "Missing directory: NA", call = NULL, 
        cppstack = NULL), class = c("std::invalid_argument", "C++Error", 
    "error", "condition")))
36: iter_function(x@dir, x@buffer_size, x@dimnames[[1]], x@dimnames[[2]], 
        nrow(x))
35: iterate_matrix(x@matrix)
34: iterate_matrix(x@matrix)
33: iterate_matrix(x@matrix)
32: iterate_matrix(x@matrix)
31: iter_function(iterate_matrix(x@matrix), row_names, col_names, 
        is.null(rownames(x)), is.null(colnames(x)))
30: FUN(X[[i]], ...)
29: FUN(X[[i]], ...)
28: lapply(x@matrix_list, iterate_matrix)
27: iterate_matrix(x@matrix)
26: iterate_matrix(x@matrix)
25: FUN(X[[i]], ...)
24: FUN(X[[i]], ...)
23: lapply(x@matrix_list, iterate_matrix)
22: iterate_matrix(x@matrix)
21: iterate_matrix(x@matrix)
20: iterate_matrix(x@matrix)
19: iterate_matrix(x@matrix)
18: iter_function(iterate_matrix(x@matrix), row_names, col_names, 
        is.null(rownames(x)), is.null(colnames(x)))
17: iterate_matrix(x@matrix)
16: iterate_matrix(x@matrix)
15: iterate_matrix(x@matrix)
14: iterate_matrix(x@matrix)
13: iterate_matrix(x@matrix)
12: iterate_matrix(x@matrix)
11: iterate_matrix_log1psimd_cpp(iterate_matrix(x@matrix))
10: iterate_matrix(x@matrix)
9: iterate_matrix(x@matrix)
8: iter_function(iterate_matrix(x@matrix), row_names, col_names, 
       is.null(rownames(x)), is.null(colnames(x)))
7: iterate_matrix(convert_matrix_type(y, "double"))
6: iterate_matrix(convert_matrix_type(y, "double"))
5: t(x) %*% y
4: t(x) %*% y
3: matrix.prod.function(x = as.matrix(sketch.matrix %*% sketch.transform), 
       y = orig.data)
2: UnSketchEmbeddings(atom.data = LayerData(object = object[[sketched.assay]], 
       layer = layers[i], features = features), atom.cells = cells.sketch, 
       orig.data = LayerData(object = object[[assay]], layer = layers[i], 
           features = features), embeddings = Embeddings(object = object[[reduction]]), 
       sketch.matrix = sketch.matrix)
1: ProjectIntegration(object = sc_merge, sketched.assay = "sketch", 
       assay = "RNA", reduction = "integrated.scvi")

Also saveRDS with destdir now fails:

 Error in x[[jj]][iseq] <- vjj : replacement has length zero
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Actually, the matrix of all uneven PNP are saved PNP01, PNP03, PNP05 etc. but not PNP02, PNP04 etc. and the object is not saved. I tried restarting R and reloading the object, but that didn't help.

I also observed that saveRDS with destdir does not work even before IntegrateLayers. So joining and splitting layers and then saving worked fine. But if i then load this object and want to save it again (without any further modifications), the above mentioned error still occurs.

Gesmira commented 1 year ago

Hi, No problem! It is helpful to see edge cases where users run into issues!

Sorry I am unable to reproduce this issue, though. The error implies that there is a directory in your object called "NA" and it cannot find that. So, perhaps something went wrong when the directories were rewritten or things were moved around.

Furthermore, to avoid confusion, you do not need to always use the destdir parameter unless you would like to make an easy to share folder or have all the files in the same place. If you simply run saveRDS(object, "/path/to/object"), the object will retain the information about where the BPCells matrices are stored, not copy/rewrite them, and (as long as you still have access to those folders) will work fine when you load the object back in.

Also, the temporary file issue you mentioned previously is fixed now in the seurat5 branch of Seurat.

 remotes::install_github("satijalab/seurat", "seurat5", quiet = TRUE)

If you are still encountering issues, another reproducible example would be useful, thank you!

mihem commented 1 year ago

hi @Gesmira

thank you, that's great.

Yes, I know that I don't have to specify destdir. However, if I don't specify destdir and just save the object, there is no error. So it's kind of my way of checking if the object is corrupt.

To make it reproducible:

just see my comment above https://github.com/satijalab/seurat/issues/7373#issuecomment-1573972882 . Now, after JoinLayer , split and saveRDS with destdir (which now works thanks to your latest fix in SeuratObject), close the R session, open it again. Now read the R object and save the object again with specifying the destdir:

so after restarting the R session (sorry the directory is called test1 no sc_merge_pre as in my comment above)

sc_merge_pre <- readRDS(file.path("test1", "sc_merge_pre.rds"))

SeuratObject::saveRDS(
  object = sc_merge_pre,
  file = "sc_merge_pre.rds",
  destdir = "test2"
)

Error in x[[jj]][iseq] <- vjj : replacement has length zero
In addition: There were 50 or more warnings (use warnings() to see the first 50)

traceback

5: `[<-.data.frame`(`*tmp*`, i, "path", value = structure(list(NULL), names = NA_character_))
4: `[<-`(`*tmp*`, i, "path", value = structure(list(NULL), names = NA_character_))
3: SaveSeuratBP(object, filename = basename(file), destdir = destdir, 
       ...)
2: saveRDS.Seurat(object = sc_merge_pre, file = "sc_merge_pre.rds", 
       destdir = "test2")
1: SeuratObject::saveRDS(object = sc_merge_pre, file = "sc_merge_pre.rds", 
       destdir = "test2")

Also when I continue with my analysis I get the missing directory NA error again (i think i didn't encounter that previously because i didn't restart the R session. So it's not specific to ProjectIntegration, but rather: after restarting the R session most/all Seurat comman don't work anymore with that object.

sc_merge <- NormalizeData(sc_merge_pre)
Error in (function (cond)  : 
  error in evaluating the argument 'x' in selecting a method for function 't': Missing directory: NA

traceback

37: (function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "Missing directory: NA", 
        call = NULL, cppstack = NULL), class = c("std::invalid_argument", 
    "C++Error", "error", "condition")))
36: stop(structure(list(message = "Missing directory: NA", call = NULL, 
        cppstack = NULL), class = c("std::invalid_argument", "C++Error", 
    "error", "condition")))
35: iter_function(x@dir, x@buffer_size, x@dimnames[[1]], x@dimnames[[2]], 
        nrow(x))
34: iterate_matrix(x@matrix)
33: iterate_matrix(x@matrix)
32: iterate_matrix(x@matrix)
31: iterate_matrix(x@matrix)
30: iter_function(iterate_matrix(x@matrix), row_names, col_names, 
        is.null(rownames(x)), is.null(colnames(x)))
29: FUN(X[[i]], ...)
28: FUN(X[[i]], ...)
27: lapply(x@matrix_list, iterate_matrix)
26: iterate_matrix(x@matrix)
25: iterate_matrix(x@matrix)
24: FUN(X[[i]], ...)
23: FUN(X[[i]], ...)
22: lapply(x@matrix_list, iterate_matrix)
21: iterate_matrix(x@matrix)
20: iterate_matrix(x@matrix)
19: iterate_matrix(x@matrix)
18: iterate_matrix(x@matrix)
17: iter_function(iterate_matrix(x@matrix), row_names, col_names, 
        is.null(rownames(x)), is.null(colnames(x)))
16: iterate_matrix(x@matrix)
15: iterate_matrix(x@matrix)
14: iterate_matrix(convert_matrix_type(x, "double"))
13: iterate_matrix(convert_matrix_type(x, "double"))
12: .local(x, ...)
11: colSums(data)
10: colSums(data)
9: BPCells::t(BPCells::t(data)/colSums(data))
8: LogNormalize.IterableMatrix(data = object, scale.factor = scale.factor, 
       margin = cmargin, verbose = verbose, ...)
7: LogNormalize(data = object, scale.factor = scale.factor, margin = cmargin, 
       verbose = verbose, ...)
6: NormalizeData.default(object = LayerData(object = object, layer = l, 
       fast = NA), normalization.method = normalization.method, 
       scale.factor = scale.factor, margin = margin, verbose = verbose, 
       layer = save, ...)
5: NormalizeData(object = LayerData(object = object, layer = l, 
       fast = NA), normalization.method = normalization.method, 
       scale.factor = scale.factor, margin = margin, verbose = verbose, 
       layer = save, ...)
4: NormalizeData.StdAssay(object = object[[assay]], normalization.method = normalization.method, 
       scale.factor = scale.factor, verbose = verbose, margin = margin, 
       ...)
3: NormalizeData(object = object[[assay]], normalization.method = normalization.method, 
       scale.factor = scale.factor, verbose = verbose, margin = margin, 
       ...)
2: NormalizeData.Seurat(sc_merge_pre)
1: NormalizeData(sc_merge_pre)

Ah and i now can see that there are indeed several dir, which are called "NA".

the first entry is as expected

sc_merge_pre[["RNA"]]$counts@matrix@matrix@matrix_list[[1]]@matrix@matrix_list[[1]]@matrix@matrix@dir

[1] "test1/PNP01"

then the second entry strangely shows PNP03 and not PNP02 (there should be PNP01, PNP02, PNP03 etc until PNP16, all of those directory exist)

sc_merge_pre[["RNA"]]$counts@matrix@matrix@matrix_list[[2]]@matrix@matrix_list[[1]]@matrix@matrix@dir

[1] "test1/PNP03"

and then from 9 on the directory is "NA"

sc_merge_pre[["RNA"]]$counts@matrix@matrix@matrix_list[[9]]@matrix@matrix_list[[1]]@matrix@matrix@dir
[1] NA

So you were right with your assumption that there are several directory in the object called NA. And this fits to my previous obervation that suddently only matrix of uneven samples are saved, so PNP01, PNP03 etc. but no PNP02 directory is created. I assume that this has to do with your last fix in SeuratObject feat/saveRDS_updates

Gesmira commented 1 year ago

Hi, following your code with my own objects I am unable to reproduce this issue, while I was able to reproduce your earlier issues. Are you able to send me the data and script (perhaps to make a smaller object with just 4 of the samples), so that I could reproduce it?

mihem commented 1 year ago

hi @Gesmira

ok that's strange, i can reproduce this issue, also with only 4 samples I used your contact form on gesmira.com to send you the data and the script.

There are actually two issue: 1) not critical, but annoying and suprising: increase of the .rds file size after JoinLayer and split as mentioned above and as mentioned here https://github.com/satijalab/seurat/issues/7316 2) the critical bug: after saving and restarting R (without restarting everything works fine), the directories names are incorrect, all even samples are skipped so that in this case the third and fourth entry have directory "NA". This prevent any further action, like running NormalizeData.

Thanks!

mihem commented 1 year ago

@Gesmira

Thank you. That solves this last critical error (2) and in the debug dataset with four samples everything works nicely (although still increase of size, point 1 in previous message).

However, when I used the entire dataset (16 h5 files in my case), then two error arise unfortunately:

When running FindTransferAnchors, the following error occurs:

anchors <- FindTransferAnchors(
  reference = reference_list$ref,
  query = reference_list$query,
  normalization.method = "LogNormalize",
  features = features

Error in (function (cond)  :
  error in evaluating the argument 'x' in selecting a method for function 't': Error opening file: /home/username/seurat_debug_v2/matrix/PNP14/index_starts

Of note, this is the same error as mentioned in my very first message of this long issue thread. However, unlike at that time, I can now save the rds object including changing the dir. Also not running JoinLayers solves this problem, however I would actully like to perform it on the entire dataset and not separately in each sample.

When restarting the R session and reloading the object with the matrices saved in the project folder and then running NormalizeData, the following error occurs, which I have not encountered before.

sc_merge <- NormalizeData(sc_merge)
corrupted size vs. prev_size

@Gesmira i will send you a reproducible example and the code.

Gesmira commented 1 year ago

Hi @mihem,
@bnprks from BPCells has reported that the first issue likely arises as there is a default limit of 512 open files at a time on Windows so accessing many on-disk matrices would hit this limit. I believe this comes into play in your case because each layer is made up of a mixture of the 16 on-disk matrices. He has pushed a more informative error message and is testing if this limit can be raised. In the meantime, I would recommend combining samples before writing to BPCells so you have less layers (and less on-disk matrices). Or after you split again (sc_merge_pre <- split(x = sc_merge_pre, f = sc_merge_pre$patient)) you could run something like this so that each layer has its own BPCells matrix directory.

for (i in Layers(sc_merge_pre)){
    sc_merge_pre[["RNA"]][[i]] <- as(sc_merge_pre[["RNA"]][[i]], "dgCMatrix")
    sc_merge_pre[["RNA"]][[i]] <- write_matrix_dir(sc_merge_pre[["RNA"]][[i]], dir = paste0("matrix2/", gsub("counts.", "", i)))
}

mihem commented 1 year ago

@Gesmira @bnprks

Great work and good catch. You are absolutely right, the problem is the amount of open files. I updated BPCells (main branch) and now I get the more informative too many open files error.

## + Performing PCA on the provided reference using 2000 features as input.
## Projecting cell embeddings
## Error in (function (cond)  :
##   error in evaluating the argument 'x' in selecting a method for function 't': Error opening file: Too many open files: /home/user/seurat_debug_v2/matrix/PNP15/val_data

@Gesmira I dit not entirely understand your first idea, but running

for (i in Layers(sc_merge_pre)){
    sc_merge_pre[["RNA"]][[i]] <- as(sc_merge_pre[["RNA"]][[i]], "dgCMatrix")
    sc_merge_pre[["RNA"]][[i]] <- write_matrix_dir(sc_merge_pre[["RNA"]][[i]], dir = paste0("matrix2/", gsub("counts.", "", i)))
}

actually solved all problems:

NormalizeData can run even after reloading, and the error corrupted size vs. prev_size does not occur any more.
FindTransferAnchors now runs successfully, and there is no too many open files issue any more.
the object is much smaller. Before running this for loop the object was 290M, now it's 24M.

so no need to further dig into problem 1 (corrupted size), it's all related. To be honest I don't understand why 16 matrices reach the limit of 512 and if this is also related to windows, why not check for the OS and allow more open files on Linux?

Shouldn't this for loop be integrated with some function when writing matrix to dir with multiple layers?

Gesmira commented 1 year ago

Hi @mihem, Glad to hear the approach fixed your issue! We'll think about implementing something like this - perhaps when you JoinLayers and split instead of creating 16 layers that are a combination of the original 16 matrices, there could be an option to write out the new layers to a directory. For now, I'll close the issue, but thanks for your thorough bug reporting throughout!

rootze commented 1 year ago

@Gesmira Hello, I have a similar problem when using BPCells. Really hope you can help me out. Thank you! I was running a test on the example provided by the tutorial. But failed to save the RDS object in the end. Error message: Error: Error: Can't move this dir to new path: Error: [Unknown system error -122] Failed to copy

library(BPCells)
library(Seurat)
library(SeuratObject)
library(SeuratDisk)
library(Azimuth)

options(Seurat.object.assay.version = "v5")

file.dir <- "/dfs7/swaruplab/Projects/rawdata/test/"
files.set <- c("ahern_pbmc.h5ad", "yoshida_pbmc.h5ad")

# Loop through h5ad files and output BPCells matrices on-disk
data.list <- c()
metadata.list <- c()

for (i in 1:length(files.set)) {
  path <- paste0(file.dir, files.set[i])
  data <- open_matrix_anndata_hdf5(path)
  write_matrix_dir(
    mat = data,
    dir = paste0(gsub(".h5ad", "", path), "_BP")
  )
  # Load in BP matrices
  mat <- open_matrix_dir(dir = paste0(gsub(".h5ad", "", path), "_BP"))
  mat <- Azimuth:::ConvertEnsembleToSymbol(mat = mat, species = "human")
  # Get metadata
  metadata.list[[i]] <- LoadH5ADobs(path = path)
  data.list[[i]] <- mat
}
# Name layers
names(data.list) <- c("ahern", "yoshida")

# Add Metadata
for (i in 1:length(metadata.list)) {
  metadata.list[[i]]$publication <- names(data.list)[i]
}
metadata.list <- lapply(metadata.list, function(x) {
  x <- x[, c("publication", "sex", "cell_type", "donor_id", "disease")]
  return(x)
})
metadata <- Reduce(rbind, metadata.list)

options(Seurat.object.assay.version = "v5")
merged.object <- CreateSeuratObject(counts = data.list, meta.data = metadata)
merged.object
# > merged.object
# An object of class Seurat
# 25431 features across 1258368 samples within 1 assay
# Active assay: RNA (25431 features, 0 variable features)
#  2 layers present: counts.ahern, counts.yoshida

saveRDS(
  object = merged.object,
  file = "obj.Rds",
  destdir = "/dfs7/swaruplab/Projects/rawdata/test/merged/"
)

Error message:

> saveRDS(
+   object = merged.object,
+   file = "obj.Rds",
+   destdir = "/dfs7/swaruplab/Projects/rawdata/test/merged/"
+ )
Warning: Changing path in object to point to new BPCells directory location
Warning: Changing path in object to point to new BPCells directory location
Warning messages:
1: Error: Error: Can't move this dir to new path: Error: [Unknown system error -122] Failed to copy '/dfs7/swaruplab/Projects/rawdata/test/ahern_pbmc_BP/col_names' to '/dfs7/swaruplab/Projects/rawdata/test/merged/ahern_pbmc_BP/col_names': Unknown system error -122
2: Error: Error: Can't move this dir to new path: Error: [Unknown system error -122] Failed to copy '/dfs7/swaruplab/Projects/rawdata/test/yoshida_pbmc_BP/col_names' to '/dfs7/swaruplab/Projects/rawdata/test/merged/yoshida_pbmc_BP/col_names': Unknown system error -122

sessionInfo shown in below

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Rocky Linux 8.8 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /dfs7/swaruplab/zechuas/tools_software/miniconda3/envs/seurat5/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_1.1.2             biomaRt_2.56.1          Azimuth_0.4.6.9004
[4] shinyBS_0.61.1          SeuratDisk_0.0.0.9020   SeuratObject_4.9.9.9086
[7] sp_2.0-0                BPCells_0.1.0

loaded via a namespace (and not attached):
  [1] fs_1.6.3                          ProtGenerics_1.32.0
  [3] matrixStats_1.0.0                 spatstat.sparse_3.0-2
  [5] bitops_1.0-7                      DirichletMultinomial_1.42.0
  [7] TFBSTools_1.38.0                  httr_1.4.6
  [9] RColorBrewer_1.1-3                tools_4.3.1
 [11] sctransform_0.3.5                 utf8_1.2.3
 [13] R6_2.5.1                          DT_0.28
 [15] lazyeval_0.2.2                    uwot_0.1.16
 [17] rhdf5filters_1.12.1               withr_2.5.0
 [19] prettyunits_1.1.1                 gridExtra_2.3
 [21] progressr_0.14.0                  cli_3.6.1
 [23] Biobase_2.60.0                    spatstat.explore_3.2-1
 [25] fastDummies_1.7.3                 EnsDb.Hsapiens.v86_2.99.0
 [27] shinyjs_2.1.0                     Seurat_4.9.9.9058
 [29] spatstat.data_3.0-1               readr_2.1.4
 [31] ggridges_0.5.4                    pbapply_1.7-2
 [33] Rsamtools_2.16.0                  R.utils_2.12.2
 [35] parallelly_1.36.0                 BSgenome_1.68.0
 [37] RSQLite_2.3.1                     generics_0.1.3
 [39] BiocIO_1.10.0                     gtools_3.9.4
 [41] ica_1.0-3                         spatstat.random_3.1-5
 [43] googlesheets4_1.1.1               GO.db_3.17.0
 [45] Matrix_1.6-0                      fansi_1.0.4
 [47] S4Vectors_0.38.1                  abind_1.4-5
 [49] R.methodsS3_1.8.2                 lifecycle_1.0.3
 [51] yaml_2.3.7                        SummarizedExperiment_1.30.2
 [53] rhdf5_2.44.0                      BiocFileCache_2.8.0
 [55] Rtsne_0.16                        grid_4.3.1
 [57] blob_1.2.4                        promises_1.2.1
 [59] shinydashboard_0.7.2              crayon_1.5.2
 [61] miniUI_0.1.1.1                    lattice_0.21-8
 [63] cowplot_1.1.1                     GenomicFeatures_1.52.1
 [65] annotate_1.78.0                   KEGGREST_1.40.0
 [67] pillar_1.9.0                      GenomicRanges_1.52.0
 [69] rjson_0.2.21                      future.apply_1.11.0
 [71] codetools_0.2-19                  fastmatch_1.1-3
 [73] leiden_0.4.3                      glue_1.6.2
 [75] data.table_1.14.8                 vctrs_0.6.3
 [77] png_0.1-8                         spam_2.9-1
 [79] cellranger_1.1.0                  gtable_0.3.3
 [81] poweRlaw_0.70.6                   cachem_1.0.8
 [83] Signac_1.11.9000                  S4Arrays_1.0.5
 [85] mime_0.12                         pracma_2.4.2
 [87] survival_3.5-5                    gargle_1.5.2
 [89] RcppRoll_0.3.0                    ellipsis_0.3.2
 [91] fitdistrplus_1.1-11               ROCR_1.0-11
 [93] nlme_3.1-163                      bit64_4.0.5
 [95] progress_1.2.2                    filelock_1.0.2
 [97] RcppAnnoy_0.0.21                  GenomeInfoDb_1.36.1
 [99] irlba_2.3.5.1                     KernSmooth_2.23-22
[101] colorspace_2.1-0                  seqLogo_1.66.0
[103] BiocGenerics_0.46.0               DBI_1.1.3
[105] tidyselect_1.2.0                  bit_4.0.5
[107] compiler_4.3.1                    curl_5.0.1
[109] hdf5r_1.3.8                       xml2_1.3.5
[111] DelayedArray_0.26.7               plotly_4.10.2
[113] rtracklayer_1.60.0                scales_1.2.1
[115] caTools_1.18.2                    lmtest_0.9-40
[117] rappdirs_0.3.3                    stringr_1.5.0
[119] digest_0.6.33                     goftest_1.2-3
[121] presto_1.0.0                      spatstat.utils_3.0-3
[123] XVector_0.40.0                    htmltools_0.5.6
[125] pkgconfig_2.0.3                   MatrixGenerics_1.12.3
[127] dbplyr_2.3.3                      fastmap_1.1.1
[129] ensembldb_2.24.0                  rlang_1.1.1
[131] htmlwidgets_1.6.2                 shiny_1.7.5
[133] zoo_1.8-12                        jsonlite_1.8.7
[135] BiocParallel_1.34.2               R.oo_1.25.0
[137] RCurl_1.98-1.12                   magrittr_2.0.3
[139] GenomeInfoDbData_1.2.10           dotCall64_1.0-2
[141] patchwork_1.1.2                   Rhdf5lib_1.22.0
[143] munsell_0.5.0                     Rcpp_1.0.11
[145] reticulate_1.31                   stringi_1.7.12
[147] zlibbioc_1.46.0                   MASS_7.3-60
[149] plyr_1.8.8                        parallel_4.3.1
[151] listenv_0.9.0                     ggrepel_0.9.3
[153] deldir_1.0-9                      CNEr_1.36.0
[155] Biostrings_2.68.1                 splines_4.3.1
[157] tensor_1.5                        hms_1.1.3
[159] BSgenome.Hsapiens.UCSC.hg38_1.4.5 igraph_1.5.1
[161] spatstat.geom_3.2-4               RcppHNSW_0.4.1
[163] reshape2_1.4.4                    stats4_4.3.1
[165] TFMPvalue_0.0.9                   XML_3.99-0.14
[167] JASPAR2020_0.99.10                tzdb_0.4.0
[169] httpuv_1.6.11                     RANN_2.6.1
[171] tidyr_1.3.0                       purrr_1.0.2
[173] polyclip_1.10-4                   future_1.33.0
[175] SeuratData_0.2.2.9001             scattermore_1.2
[177] ggplot2_3.4.2                     xtable_1.8-4
[179] restfulr_0.0.15                   AnnotationFilter_1.24.0
[181] RSpectra_0.16-1                   later_1.3.1
[183] googledrive_2.1.1                 viridisLite_0.4.2
[185] tibble_3.2.1                      memoise_2.0.1
[187] AnnotationDbi_1.62.2              GenomicAlignments_1.36.0
[189] IRanges_2.34.1                    cluster_2.1.4
[191] globals_0.16.2

Gesmira commented 1 year ago

Hi @rootze, Can you confirm you have disk space available?

rootze commented 1 year ago

Hi @rootze,

Can you confirm you have disk space available?

@Gesmira Thanks for your quick response. Yes. I have more than enough disk space available.

rootze commented 1 year ago

@Gesmira FYI, I followed the tutorial from the section Load Data from one h5 file, and it works. I guess something in the process of handling h5ad may stop and cause the error. Thank you!