snap-stanford / stellar

MIT License
61 stars 17 forks source link

Stellar running forever #13

Open LukasHats opened 1 month ago

LukasHats commented 1 month ago

THanks for providing stellar. I am currently trying to run stellar on the Hubmap demo dataset on our Cluster. Although it states that it should finish quite fast, it runs >24h. I see that the GPU gets used, although just around 2.5 MB. I am not sure whats wrong. The loss also gets printed.

My environment:

 Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
anndata                   0.7.6                    pypi_0    pypi
blas                      1.0                         mkl  
blosc2                    2.0.0                    pypi_0    pypi
bottleneck                1.3.7            py38ha9d4c09_0  
brotli-python             1.0.9            py38h6a678d5_8  
ca-certificates           2024.9.24            h06a4308_0  
certifi                   2024.8.30        py38h06a4308_0  
charset-normalizer        3.3.2              pyhd3eb1b0_0  
contourpy                 1.1.1                    pypi_0    pypi
cudatoolkit               11.3.1               h2bc3f7f_2  
cycler                    0.12.1                   pypi_0    pypi
cython                    3.0.11                   pypi_0    pypi
fonttools                 4.54.1                   pypi_0    pypi
h5py                      3.11.0                   pypi_0    pypi
idna                      3.7              py38h06a4308_0  
igraph                    0.9.10                   pypi_0    pypi
imageio                   2.35.1                   pypi_0    pypi
importlib-metadata        8.5.0                    pypi_0    pypi
importlib-resources       6.4.5                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46306  
jinja2                    3.1.4            py38h06a4308_0  
joblib                    1.4.2            py38h06a4308_0  
kiwisolver                1.4.7                    pypi_0    pypi
ld_impl_linux-64          2.40                 h12ee557_0  
legacy-api-wrap           1.4                      pypi_0    pypi
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgfortran-ng            11.2.0               h00389a5_1  
libgfortran5              11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
libuv                     1.48.0               h5eee18b_0  
llvmlite                  0.41.1                   pypi_0    pypi
louvain                   0.7.1                    pypi_0    pypi
markupsafe                2.1.3            py38h5eee18b_0  
matplotlib                3.6.3                    pypi_0    pypi
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0            py38h5eee18b_1  
mkl_fft                   1.3.8            py38h5eee18b_0  
mkl_random                1.2.4            py38hdb19cb5_0  
msgpack                   1.1.0                    pypi_0    pypi
natsort                   8.4.0                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
networkx                  3.1              py38h06a4308_0  
ninja                     1.10.2               h06a4308_5  
ninja-base                1.10.2               hd09550d_5  
numba                     0.58.1                   pypi_0    pypi
numexpr                   2.8.4            py38hc78ab66_1  
numpy                     1.22.4                   pypi_0    pypi
openssl                   3.0.15               h5eee18b_0  
packaging                 24.1             py38h06a4308_0  
pandas                    1.3.0                    pypi_0    pypi
patsy                     0.5.6                    pypi_0    pypi
pillow                    10.4.0                   pypi_0    pypi
pip                       24.2             py38h06a4308_0  
platformdirs              3.10.0           py38h06a4308_0  
pooch                     1.7.0            py38h06a4308_0  
py-cpuinfo                9.0.0                    pypi_0    pypi
pyg                       2.0.4           py38_torch_1.10.0_cu113    pyg
pynndescent               0.5.13                   pypi_0    pypi
pyparsing                 3.1.2            py38h06a4308_0  
pysocks                   1.7.1            py38h06a4308_0  
python                    3.8.20               he870216_0  
python-dateutil           2.9.0post0       py38h06a4308_2  
python-louvain            0.1                      pypi_0    pypi
python-tzdata             2023.3             pyhd3eb1b0_0  
pytorch                   1.10.2          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-cluster           1.6.0           py38_torch_1.10.0_cu113    pyg
pytorch-mutex             1.0                        cuda    pytorch
pytorch-scatter           2.0.9           py38_torch_1.10.0_cu113    pyg
pytorch-sparse            0.6.13          py38_torch_1.10.0_cu113    pyg
pytorch-spline-conv       1.2.1           py38_torch_1.10.0_cu113    pyg
pytz                      2024.1           py38h06a4308_0  
pywavelets                1.4.1                    pypi_0    pypi
pyyaml                    6.0.1            py38h5eee18b_0  
readline                  8.2                  h5eee18b_0  
requests                  2.32.3           py38h06a4308_0  
scanpy                    1.8.0                    pypi_0    pypi
scikit-image              0.18.0                   pypi_0    pypi
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.7.0                    pypi_0    pypi
seaborn                   0.13.2                   pypi_0    pypi
setuptools                75.1.0           py38h06a4308_0  
sinfo                     0.3.4                    pypi_0    pypi
six                       1.16.0             pyhd3eb1b0_1  
sqlite                    3.45.3               h5eee18b_0  
statsmodels               0.14.1                   pypi_0    pypi
stdlib-list               0.10.0                   pypi_0    pypi
tables                    3.8.0                    pypi_0    pypi
tbb                       2021.8.0             hdb19cb5_0  
texttable                 1.7.0                    pypi_0    pypi
threadpoolctl             3.5.0            py38h2f386ee_0  
tifffile                  2023.7.10                pypi_0    pypi
tk                        8.6.14               h39e8969_0  
tqdm                      4.66.5           py38h2f386ee_0  
typing_extensions         4.11.0           py38h06a4308_0  
umap-learn                0.5.6                    pypi_0    pypi
urllib3                   2.2.3            py38h06a4308_0  
wheel                     0.44.0           py38h06a4308_0  
xlrd                      1.2.0                    pypi_0    pypi
xz                        5.4.6                h5eee18b_1  
yacs                      0.1.6              pyhd3eb1b0_1  
yaml                      0.2.5                h7b6447c_0  
zipp                      3.20.2                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_1  

My slurm file

#!/bin/sh
#SBATCH --job-name="STELLAR_demo_2_241002"
#SBATCH --partition=gpu-single
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=16
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --mem=350gb

module load devel/cuda
module load devel/miniconda/3
source $MINICONDA_HOME/etc/profile.d/conda.sh
conda activate stellar

cd /gpfs/bwfor/work/ws/hd_bm327-phenotyping_benchmark/stellar/

conda run -n stellar python STELLAR_run.py --dataset Hubmap --num-heads 23

This is the GPU usage

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0             71W /  400W |    2371MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3981110      C   python                                       2362MiB |
+-----------------------------------------------------------------------------------------+

I have not changed any of the scripts. DOes anyone have a suggestions?