laminlabs / lamindb

A data framework for biology.
https://docs.lamin.ai
Apache License 2.0
120 stars 9 forks source link

MappedCollection reported an error on the specific data set #1845

Closed gefujing closed 2 weeks ago

gefujing commented 2 weeks ago

Hello! We are testing large-scale data processing using the cellxgene database in lamindb. We carried out in accordance with the guide of https://docs.lamin.ai/scrna5.

The main code we run is as follows:

artifacts = ln.Artifact.using("laminlabs/cellxgene")
artifacts = artifacts.filter(_accessor="AnnData")
artifacts = artifacts[0:50]
for artifact in artifacts:
    artifact.save()
artifactsfortest = artifacts
collection = ln.Collection(
    artifactsfortest,
    name="traintime", 
    version="1"
)
collection.describe()
collection.save()
dataset = collection.mapped(obs_keys=["cell_type"], join="outer")
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("cell_type"), 
    num_samples=len(dataset)
)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler, num_workers=8, prefetch_factor=2)
start_time = time.time()
for batch_idx, batch in enumerate(tqdm(dataloader, desc="Processing Batches")):
    pass
end_time = time.time()
total_time = end_time - start_time
print(f"Total time taken to process all batches: {total_time:.4f} seconds")

However, the code is reporting an error for unknown reasons:

IndexError: Caught IndexError in DataLoader worker process 12.
Original Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/lamindb/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/lamindb/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/lamindb/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/home/ubuntu/miniconda3/envs/lamindb/lib/python3.11/site-packages/lamindb/core/_mapped_collection.py", line 286, in __getitem__
    out[layers_key] = self._get_data_idx(
                      ^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/lamindb/lib/python3.11/site-packages/lamindb/core/_mapped_collection.py", line 341, in _get_data_idx
    lazy_data_idx[var_idxs_join[indices[s]]] = data_s
                  ~~~~~~~~~~~~~^^^^^^^^^^^^
IndexError: index 25025 is out of bounds for axis 0 with size 25021

After testing, we tentatively realized that this reported error seems to be independent of the number of cells (we test in many small datasets, running well). The source of the error is the MappedCollection (a class) defined by lamindb. In the def getitem section, this class performs the extraction of gene expression information and labels through the indexes generated earlier. Due to the merging of multiple datasets, an unknown error occurred in the generation of the index.

Could you please help me check and fix this problem? error information.pptx

Koncopd commented 2 weeks ago

Hello, @gefujing , thank you for the reporting the issue. I will try to figure out. It would be also very helpful, if you can provide a very minimal example where it fails, like with two or three specific datasets.

gefujing commented 2 weeks ago

Thank you very much! You could see the specific dataset when you run _artifact_to_remove = artifacts.get(uid="upR31puIm5bp3AC7Xy8m")_. The code work well after we remove the data set by _collection.artifacts.remove(artifact_toremove).

Koncopd commented 2 weeks ago

Thank you, i will check what is with the dataset.

Koncopd commented 2 weeks ago

Ok, i see now that the dataset has .X in csc sparse format, MappedCollection doesn't support csc yet, i will track this here https://github.com/laminlabs/lamindb/issues/1873 . I will also add a check for csc matrices for now.