Segmentation fault in mu.neighbor

Feilijiang commented 1 year ago

Describe the bug Hi, I'm using the muon to run the co-embedding of 280k multiome dataset and submit it to the lsf system with 40CPU+300GRAM resource. It errored with ''Segmentation fault" in the error.log and 'Exited with exit code 139' in the output.log. When I use a subset of 2000 cells, it works totally fine. Do you know how to fix it? Thank you very much for your help.

Here is my code

import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import muon as mu
from muon import atac as ac
import mudata as md
from mudata import MuData
import os
import bbknn

mdata = mu.read("Cellarchr.h5mu") # 280k multiome dataset
mdata.update()
mu.pp.intersect_obs(mdata)

# Since subsetting was performed after calculating nearest neighbours,
# we have to calculate them again for each modality.
bbknn.bbknn(mdata['rna'], batch_key='brc_code', n_pcs=50, metric='euclidean', trim=200)
bbknn.bbknn(mdata['atac'], batch_key='brc_code', n_pcs=40, metric='euclidean', trim=200)

# Calculate weighted nearest neighbors
mu.pp.neighbors(mdata, key_added='wnn',n_multineighbors=50)
# report  Segmentation fault

System

OS: linux
Python version 3.8.13
Versions of libraries involved Package Version

anndata 0.8.0 bbknn 1.5.1 h5py 3.7.0 leidenalg 0.9.0 loompy 3.0.7 louvain 0.7.1 mudata 0.2.1 muon 0.1.2 networkx 2.8.6 notebook 6.4.12 numba 0.55.2 numpy 1.22.4 numpy-groupies 0.9.19 pandas 1.4.4 scanpy 1.9.1 scikit-learn 1.1.2 scikit-misc 0.1.4 scipy 1.9.1 seaborn 0.12.0 sklearn 0.0.post1 tornado 6.2 tqdm 4.64.1 umap-learn 0.5.3

gtca commented 1 year ago

Hey @Feilijiang, thanks for letting us know we should take a look at the performance on large datasets! It looks like a memory-related issue but there's hardly more that I can say from this information.

Is this a reproducible issue? Do you know the memory consumption of the mu.pp.neighbors() call?

Feilijiang commented 1 year ago

I'm not sure whether it is reproducible. But I did try it several times with n_multineighbors ranging from 200 to 20 in the command line and it just ended without any notification. Then I submit the job and it error with segmentation fault. This is memory usage from the lsf job.

Thank you for the help!

gtca commented 1 year ago

Is there any log associated with the segmentation fault that you could share?

gtca commented 5 days ago

Closing as stale

scverse / muon

Segmentation fault in mu.neighbor #86