kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
137 stars 14 forks source link

reproducing the pre-training #26

Closed GengGengJiuXi closed 1 month ago

GengGengJiuXi commented 2 months ago

Hyperparameter groups: [{'weight_decay': 0.0}] [2024-04-25 11:09:58,048][main][INFO] - Optimizer group 0 | 10 tensors | weight_decay 0.1 [2024-04-25 11:09:58,048][main][INFO] - Optimizer group 1 | 9 tensors | weight_decay 0.0

Sanity Checking: 0it [00:00, ?it/s] Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]

I'm getting the above error when reproducing the pre-training, what is the reason for it?

GengGengJiuXi commented 2 months ago

sorry,I'm running the following code。 python -m train \ experiment=hg38/hg38 \ callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \ dataset.max_length=1024 \ dataset.batch_size=1024 \ dataset.mlm=true \ dataset.mlm_probability=0.15 \ dataset.rc_aug=false \ model=caduceus \ model.config.d_model=128 \ model.config.n_layer=4 \ model.config.bidirectional=true \ model.config.bidirectional_strategy=add \ model.config.bidirectional_weight_tie=true \ model.config.rcps=true \ optimizer.lr="8e-3" \ train.global_batch_size=8 \ trainer.max_steps=10000 \ +trainer.val_check_interval=10000 \ wandb=null

And if you need more error information, please contact me? thanks

yair-schiff commented 2 months ago

Can you perhaps provide a bit more information? I am not sure I see what error you are referring to above.

GengGengJiuXi commented 2 months ago

The erro file: Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=1024', 'dataset.mlm=true', 'dataset.mlm_probability=0.15', 'dataset.rc_aug=false', 'model=caduceus', 'model.config.d_model=64', 'model.config.n_layer=1', 'model.config.bidirectional=true', 'model.config.bidirectional_strategy=add', 'model.config.bidirectional_weight_tie=true', 'model.config.rcps=true', 'optimizer.lr=8e-3', 'train.global_batch_size=8', 'trainer.max_steps=10000', '+trainer.val_check_interval=100', 'wandb=null']

The out file: [2024-04-25 10:39:41,224][src.dataloaders.genomics][INFO] - HG38Using Char-level tokenizer finish self.tokenizer sta init_datasets Hyperparameter groups: [{'weight_decay': 0.0}] [2024-04-25 10:39:43,731][main][INFO] - Optimizer group 0 | 10 tensors | weight_decay 0.1 [2024-04-25 10:39:43,731][main][INFO] - Optimizer group 1 | 9 tensors | weight_decay 0.0

Sanity Checking: 0it [00:00, ?it/s] Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]

GengGengJiuXi commented 2 months ago

OSError: [Errno 8] Exec format error: '.conda/envs/caduceus/lib/python3.10/site-packages/triton/compiler/../third_party/cuda/bin/ptxas'

GengGengJiuXi commented 2 months ago

and if i use run_pretrain_caduceus.sh this is output and erro config:
target: caduceus.configuration_caduceus.CaduceusConfig
d_model: 256
n_layer: 8
vocab_size: 12
ssm_cfg:
d_state: 16
d_conv: 4
expand: 2
dt_rank: auto
dt_min: 0.001
dt_max: 0.1
dt_init: random
dt_scale: 1.0
dt_init_floor: 0.0001
conv_bias: true
bias: false
use_fast_path: true
rms_norm: true
fused_add_norm: true
residual_in_fp32: false
pad_vocab_size_multiple: 8
norm_epsilon: 1.0e-05
initializer_cfg:
initializer_range: 0.02
rescale_prenorm_residual: true
n_residuals_per_layer: 1
bidirectional: true
bidirectional_strategy: add
bidirectional_weight_tie: true
rcps: true
complement_map: null

[2024-04-24 21:34:23,794][main][WARNING] - Sleeping for 36 seconds [2024-04-24 21:34:59,853][main][WARNING] - Sleeping for 60 seconds [2024-04-24 21:35:59,939][main][WARNING] - Sleeping for 38 seconds

yair-schiff commented 2 months ago

Are you torch.compile-ing the model? Do you know why the triton benchmarking code is being triggered. I do not think I see this in my logs

GengGengJiuXi commented 2 months ago

Here's the installation package in my environment I have installed trition-2.1.0,but my system architecture is aarch64, I'm not sure if that's the reason?

Name Version Build Channel

_libgcc_mutex 0.1 main _openmp_mutex 5.1 51_gnu abseil-cpp 20211102.0 h22f4aa5_0 accelerate 0.29.3 pypi_0 pypi aiohttp 3.9.3 py310h998d150_0 aiosignal 1.2.0 pyhd3eb1b0_0 antlr-python-runtime 4.9.3 pyhd8ed1ab_1 conda-forge anyio 4.2.0 py310hd43f75c_0 appdirs 1.4.4 pyhd3eb1b0_0 argon2-cffi 21.3.0 pyhd3eb1b0_0 argon2-cffi-bindings 21.2.0 py310h2f4d8fa_0 arrow-cpp 14.0.2 h001d45f_1 asttokens 2.0.5 pyhd3eb1b0_0 async-lru 2.0.4 py310hd43f75c_0 async-timeout 4.0.3 py310hd43f75c_0 attrs 23.1.0 py310hd43f75c_0 aws-c-auth 0.6.19 h998d150_0 aws-c-cal 0.5.20 h6ac735f_0 aws-c-common 0.8.5 h998d150_0 aws-c-compression 0.2.16 h998d150_0 aws-c-event-stream 0.2.15 h419075a_0 aws-c-http 0.6.25 h998d150_0 aws-c-io 0.13.10 h998d150_0 aws-c-mqtt 0.7.13 h998d150_0 aws-c-s3 0.1.51 h6ac735f_0 aws-c-sdkutils 0.1.6 h998d150_0 aws-checksums 0.1.13 h998d150_0 aws-crt-cpp 0.18.16 h419075a_0 aws-sdk-cpp 1.10.55 h3140d82_0 babel 2.11.0 py310hd43f75c_0 beautifulsoup4 4.12.2 py310hd43f75c_0 biopython 1.79 py310h7cee911_1 conda-forge blas 1.0 openblas bleach 4.1.0 pyhd3eb1b0_0 boost-cpp 1.82.0 hb8fdbf2_2 bottleneck 1.3.7 py310hf6ef57e_0 brotli 1.0.9 h998d150_7 brotli-bin 1.0.9 h998d150_7 brotli-python 1.0.9 py310h419075a_7 bzip2 1.0.8 h998d150_5 c-ares 1.19.1 h998d150_0 ca-certificates 2024.3.11 hd43f75c_0 cached-property 1.5.2 py_0 cachetools 4.2.2 pyhd3eb1b0_0 causal-conv1d 1.2.0.post2 pypi_0 pypi certifi 2024.2.2 py310hd43f75c_0 cffi 1.16.0 py310h998d150_0 charset-normalizer 2.0.4 pyhd3eb1b0_0 click 8.1.7 py310hd43f75c_0 colorama 0.4.6 py310hd43f75c_0 comm 0.2.1 py310hd43f75c_0 contourpy 1.2.0 py310hb8fdbf2_0 cycler 0.11.0 pyhd3eb1b0_0 datasets 2.12.0 py310hd43f75c_0 anaconda debugpy 1.6.7 py310h419075a_0 decorator 5.1.1 pyhd3eb1b0_0 defusedxml 0.7.1 pyhd3eb1b0_0 dill 0.3.6 py310hd43f75c_0 discrete-key-value-bottleneck-pytorch 0.1.1 pypi_0 pypi docker-pycreds 0.4.0 pyhd3eb1b0_0 einops 0.7.0 pyhd8ed1ab_1 conda-forge einx 0.2.1 pypi_0 pypi enformer-pytorch 0.8.8 pypi_0 pypi exceptiongroup 1.2.0 py310hd43f75c_0 executing 0.8.3 pyhd3eb1b0_0 filelock 3.13.1 py310hd43f75c_0 flash-attn 2.5.7 pypi_0 pypi fonttools 4.51.0 py310h998d150_0 freetype 2.12.1 h6df46f4_0 frozendict 2.4.2 pypi_0 pypi frozenlist 1.4.0 py310h998d150_0 fsspec 2023.9.2 py310hd43f75c_0 anaconda future 0.18.3 py310hd43f75c_0 gdown 5.1.0 pypi_0 pypi genomic-benchmarks 0.0.9 pypi_0 pypi gflags 2.2.2 h419075a_1 git-lfs 3.5.1 h8af1aa0_0 conda-forge gitdb 4.0.7 pyhd3eb1b0_0 gitpython 3.1.37 py310hd43f75c_0 glog 0.5.0 h419075a_1 grpc-cpp 1.48.2 hdefc9b7_1 h11 0.14.0 py310hd43f75c_0 h5py 3.11.0 nompi_py310h7a20aa2_100 conda-forge hdf5 1.14.3 nompi_ha486f32_100 conda-forge httpcore 1.0.2 py310hd43f75c_0 httpx 0.26.0 py310hd43f75c_0 huggingface-hub 0.19.4 pypi_0 pypi huggingface_hub 0.20.3 py310hd43f75c_0 hydra-core 1.3.2 pyhd8ed1ab_0 conda-forge icu 73.1 h419075a_0 idna 3.4 py310hd43f75c_0 importlib-metadata 7.0.1 py310hd43f75c_0 importlib_metadata 7.0.1 hd3eb1b0_0 importlib_resources 6.1.1 py310hd43f75c_1 iniconfig 1.1.1 pyhd3eb1b0_0 ipdb 0.13.13 pyhd8ed1ab_0 conda-forge ipykernel 6.28.0 py310hd43f75c_0 ipython 8.20.0 py310hd43f75c_0 jedi 0.18.1 py310hd43f75c_1 jinja2 3.1.3 py310hd43f75c_0 joblib 1.2.0 py310hd43f75c_0 jpeg 9e h998d150_1 json5 0.9.6 pyhd3eb1b0_0 jsonschema 4.19.2 py310hd43f75c_0 jsonschema-specifications 2023.7.1 py310hd43f75c_0 jupyter-lsp 2.2.0 py310hd43f75c_0 jupyter_client 8.6.0 py310hd43f75c_0 jupyter_core 5.5.0 py310hd43f75c_0 jupyter_events 0.8.0 py310hd43f75c_0 jupyter_server 2.10.0 py310hd43f75c_0 jupyter_server_terminals 0.4.4 py310hd43f75c_1 jupyterlab 4.1.6 pyhd8ed1ab_0 conda-forge jupyterlab_pygments 0.1.2 py_0 jupyterlab_server 2.25.1 py310hd43f75c_0 kiwisolver 1.4.4 py310h419075a_0 krb5 1.20.1 h2e2fba8_1 lcms2 2.12 h5246980_0 ld_impl_linux-aarch64 2.38 h8131f2d_1 lerc 3.0 h22f4aa5_0 libaec 1.1.3 h2f0025b_0 conda-forge libboost 1.82.0 hda0696e_2 libbrotlicommon 1.0.9 h998d150_7 libbrotlidec 1.0.9 h998d150_7 libbrotlienc 1.0.9 h998d150_7 libcurl 8.5.0 hfa2bbb0_0 libdeflate 1.17 h998d150_1 libedit 3.1.20230828 h998d150_0 libev 4.33 hfd63f10_1 libevent 2.1.12 h6ac735f_1 libffi 3.4.4 h419075a_0 libgcc-ng 13.2.0 hf8544c7_5 conda-forge libgfortran-ng 13.2.0 he9431aa_5 conda-forge libgfortran5 13.2.0 h582850c_5 conda-forge libgomp 13.2.0 hf8544c7_5 conda-forge libnghttp2 1.57.0 hb788212_0 libnsl 2.0.1 h31becfc_0 conda-forge libopenblas 0.3.21 hc2e42e2_0 libpng 1.6.39 h998d150_0 libprotobuf 3.20.3 h94b7715_0 libsodium 1.0.18 hfd63f10_0 libsqlite 3.45.3 h194ca79_0 conda-forge libssh2 1.10.0 h6ac735f_2 libstdcxx-ng 13.2.0 h9a76618_5 conda-forge libthrift 0.15.0 hb2e9abc_2 libtiff 4.5.1 h419075a_0 libuuid 2.38.1 hb4cce97_0 conda-forge libwebp-base 1.3.2 h998d150_0 libxcrypt 4.4.36 h31becfc_1 conda-forge libzlib 1.2.13 h31becfc_5 conda-forge lightning-utilities 0.9.0 py310hd43f75c_0 lz4-c 1.9.4 h419075a_0 mamba-ssm 1.2.0.post1 pypi_0 pypi markdown-it-py 2.2.0 py310hd43f75c_1 markupsafe 2.1.3 py310h998d150_0 matplotlib 3.8.4 py310hbbe02a8_0 conda-forge matplotlib-base 3.8.4 py310hfb1e5ee_0 matplotlib-inline 0.1.6 py310hd43f75c_0 mdurl 0.1.0 py310hd43f75c_0 mistune 2.0.4 py310hd43f75c_0 mpmath 1.3.0 pypi_0 pypi multidict 6.0.4 py310h998d150_0 multiprocess 0.70.14 py310hd43f75c_0 anaconda nbclient 0.8.0 py310hd43f75c_0 nbconvert 7.10.0 py310hd43f75c_0 nbformat 5.9.2 py310hd43f75c_0 ncurses 6.4.20240210 h0425590_0 conda-forge nest-asyncio 1.6.0 py310hd43f75c_0 networkx 3.3 pypi_0 pypi ninja 1.11.1.1 pypi_0 pypi ninja-base 1.10.2 h59a28a9_5 notebook 7.1.3 pyhd8ed1ab_0 conda-forge notebook-shim 0.2.3 py310hd43f75c_0 numexpr 2.8.7 py310hbc6faf5_0 numpy 1.26.4 py310he45c16d_0 numpy-base 1.26.4 py310h15d264d_0 nvidia-ml-py 12.535.133 py310hd43f75c_0 nvitop 1.3.2 py310h4c7bcd0_0 conda-forge omegaconf 2.3.0 pyhd8ed1ab_0 conda-forge openjpeg 2.4.0 hf3eb033_0 openssl 3.2.1 h31becfc_1 conda-forge orc 1.7.4 h7ed1058_1 overrides 7.4.0 py310hd43f75c_0 packaging 23.2 py310hd43f75c_0 pandas 2.2.2 py310hf9cab1f_0 conda-forge pandocfilters 1.5.0 pyhd3eb1b0_0 parso 0.8.3 pyhd3eb1b0_0 pathtools 0.1.2 pyhd3eb1b0_1 patsy 0.5.3 py310hd43f75c_0 pexpect 4.8.0 pyhd3eb1b0_3 pillow 10.2.0 py310h998d150_0 pip 23.3.1 py310hd43f75c_0 platformdirs 3.10.0 py310hd43f75c_0 pluggy 1.5.0 pyhd8ed1ab_0 conda-forge polars 0.20.22 pypi_0 pypi portalocker 2.3.0 py310hd43f75c_1 prometheus_client 0.14.1 py310hd43f75c_0 prompt-toolkit 3.0.43 py310hd43f75c_0 prompt_toolkit 3.0.43 hd3eb1b0_0 protobuf 3.20.3 py310h419075a_0 psutil 5.9.0 py310h998d150_0 ptyprocess 0.7.0 pyhd3eb1b0_2 pure_eval 0.2.2 pyhd3eb1b0_0 pyarrow 14.0.2 py310hcc88a3e_0 pycparser 2.21 pyhd3eb1b0_0 pyfaidx 0.8.1.1 pyhdfd78af_0 bioconda pygments 2.15.1 py310hd43f75c_1 pyparsing 3.0.9 py310hd43f75c_0 pysocks 1.7.1 py310hd43f75c_0 pytest 8.1.1 pyhd8ed1ab_0 conda-forge python 3.10.14 hbbe8eec_0_cpython conda-forge python-dateutil 2.8.2 pyhd3eb1b0_0 python-fastjsonschema 2.16.2 py310hd43f75c_0 python-json-logger 2.0.7 py310hd43f75c_0 python-tzdata 2023.3 pyhd3eb1b0_0 python-xxhash 2.0.2 py310h998d150_1 python_abi 3.10 2_cp310 conda-forge pytorch-lightning 1.9.0 pyhd3eb1b0_1 forklift pytz 2023.3.post1 py310hd43f75c_0 pyvcf3 1.0.3 pyhdfd78af_0 bioconda pyyaml 6.0.1 py310h998d150_0 pyzmq 25.1.2 py310h419075a_0 re2 2022.04.01 h22f4aa5_0 readline 8.2 h998d150_0 redis-py 5.0.4 pyhd8ed1ab_0 conda-forge referencing 0.30.2 py310hd43f75c_0 regex 2023.10.3 py310h998d150_0 requests 2.31.0 py310hd43f75c_1 responses 0.13.3 pyhd3eb1b0_0 rfc3339-validator 0.1.4 py310hd43f75c_0 rfc3986-validator 0.1.1 py310hd43f75c_0 rich 13.7.1 pyhd8ed1ab_0 conda-forge rpds-py 0.10.6 py310h7f3cb11_0 s2n 1.3.27 h6ac735f_0 safetensors 0.4.2 py310hdd6b545_0 scikit-learn 1.4.2 py310hc266c7b_0 conda-forge scipy 1.12.0 py310he45c16d_0 seaborn 0.13.2 hd8ed1ab_0 conda-forge seaborn-base 0.13.2 pyhd8ed1ab_0 conda-forge send2trash 1.8.2 py310hd43f75c_0 sentry-sdk 1.9.0 py310hd43f75c_0 setproctitle 1.2.2 py310h2f4d8fa_0 setuptools 68.2.2 py310hd43f75c_0 six 1.16.0 pyhd3eb1b0_1 smmap 4.0.0 pyhd3eb1b0_0 snappy 1.1.10 h419075a_1 sniffio 1.3.0 py310hd43f75c_0 soupsieve 2.5 py310hd43f75c_0 sqlite 3.41.2 h998d150_0 stack_data 0.2.0 pyhd3eb1b0_0 statsmodels 0.14.0 py310hf6ef57e_0 sympy 1.12 pypi_0 pypi termcolor 2.1.0 py310hd43f75c_0 terminado 0.17.1 py310hd43f75c_0 threadpoolctl 2.2.0 pyh0d69192_0 timm 0.9.16 pyhd8ed1ab_0 conda-forge tinycss2 1.2.1 py310hd43f75c_0 tk 8.6.13 h194ca79_0 conda-forge tokenizers 0.15.1 py310hb4c1b22_0 toml 0.10.2 pyhd3eb1b0_0 tomli 2.0.1 py310hd43f75c_0 torch 2.0.1+cu118 pypi_0 pypi torchaudio 2.0.2+cu118 pypi_0 pypi torchdata 0.5.1 pyh2db4395_0 conda-forge torchmetrics 1.3.2 pyhd8ed1ab_0 conda-forge torchtext 0.17.0a0+f3b7a01 pypi_0 pypi torchvision 0.15.2+cu118 pypi_0 pypi tornado 6.3.3 py310h998d150_0 tqdm 4.66.2 pyhd8ed1ab_0 conda-forge traitlets 5.7.1 py310hd43f75c_0 transformers 4.39.3 pyhd8ed1ab_0 conda-forge triton 2.1.0 pypi_0 pypi typing-extensions 4.9.0 py310hd43f75c_1 typing_extensions 4.9.0 py310hd43f75c_1 tzdata 2024a h04d1e81_0 unicodedata2 15.1.0 py310h998d150_0 urllib3 2.1.0 py310hd43f75c_1 utf8proc 2.6.1 h998d150_1 vector-quantize-pytorch 1.14.7 pypi_0 pypi wandb 0.13.10 pyhd3eb1b0_0 forklift wcwidth 0.2.5 pyhd3eb1b0_0 webencodings 0.5.1 py310hd43f75c_1 websocket-client 0.58.0 py310hd43f75c_4 wheel 0.41.2 py310hd43f75c_0 xxhash 0.8.0 h2f4d8fa_3 xz 5.4.6 h998d150_0 yaml 0.2.5 hfd63f10_0 yarl 1.9.3 py310h998d150_0 zeromq 4.3.5 h419075a_0 zipp 3.17.0 py310hd43f75c_0 zlib 1.2.13 h31becfc_5 conda-forge zstd 1.5.5 h6a09583_0

yair-schiff commented 1 month ago

Apologies, but I am not sure what is causing your issue. Perhaps try a fresh env created using the yaml file in this repo?