Which image is used for this job?

AatroxZZ commented 7 months ago

I want to ask which image is used for this job, I can't run train.sh after I complete the Installation using pytorch:23.06 following the steps prompted by installation

jzhang38 commented 7 months ago

What is your error?

AatroxZZ commented 7 months ago

What is your error?

Traceback (most recent call last): File "/mnt/data/users/zxb/EasyContext/train.py", line 11, in from transformers import LlamaForCausalLM File "", line 1075, in _handle_fromlist File "/root/anaconda3/envs/easycontext/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1463, in getattr value = getattr(module, name) File "/root/anaconda3/envs/easycontext/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1462, in getattr module = self._get_module(self._class_to_module[name]) File "/root/anaconda3/envs/easycontext/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1474, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback): /root/anaconda3/envs/easycontext/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv E0408 05:34:59.302856 140281121617728 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 144153) of binary: /root/anaconda3/envs/easycontext/bin/python

AatroxZZ commented 7 months ago

the package version is Package Version

accelerate 0.28.0 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 appdirs 1.4.4 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 contourpy 1.2.1 cycler 0.12.1 datasets 2.18.0 deepspeed 0.14.0 dill 0.3.8 docker-pycreds 0.4.0 einops 0.7.0 evaluate 0.4.1 filelock 3.13.1 flash-attn 2.5.6 fonttools 4.51.0 frozenlist 1.4.1 fsspec 2024.2.0 gitdb 4.0.11 GitPython 3.1.43 hjson 3.1.0 huggingface-hub 0.22.2 idna 3.6 Jinja2 3.1.3 joblib 1.3.2 kiwisolver 1.4.5 MarkupSafe 2.1.3 matplotlib 3.8.4 mpmath 1.2.1 multidict 6.0.5 multiprocess 0.70.16 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.1.105 nvidia-nvtx-cu12 12.1.105 packaging 24.0 pandas 2.2.1 pillow 10.3.0 pip 23.3.1 protobuf 4.25.3 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 2.6.4 pydantic_core 2.16.3 pynvml 11.5.0 pyparsing 3.1.2 python-dateutil 2.9.0.post0 pytorch-triton 3.0.0+989adb9a29 pytz 2024.1 PyYAML 6.0.1 quanto 0.1.0 regex 2023.12.25 requests 2.31.0 responses 0.18.0 ring-flash-attn 0.1 safetensors 0.4.2 scikit-learn 1.4.1.post1 scipy 1.13.0 seaborn 0.13.2 sentencepiece 0.2.0 sentry-sdk 1.44.1 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 smmap 5.0.1 sympy 1.12 threadpoolctl 3.4.0 tokenizers 0.15.2 torch 2.4.0.dev20240324+cu121 tqdm 4.66.2 transformers 4.39.1 typing_extensions 4.8.0 tzdata 2024.1 urllib3 2.2.1 wandb 0.16.6 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4

jzhang38 commented 7 months ago

Your flash attention is not installed correctly. You can try to compile it from source:

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install

AatroxZZ commented 7 months ago

Your flash attention is not installed correctly. You can try to compile it from source:
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install

Still not working...

AatroxZZ commented 7 months ago

Your flash attention is not installed correctly. You can try to compile it from source:
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
Still not working...

这个是否是版本的问题，请问作者这个flash-attn的版本是多少呢。我尝试了许多不同版本的torch和flash-atten的组合，但是都失败了，并且用源码编译运行仍然会导致上述的错误

jzhang38 commented 7 months ago

我的是2.5.6

我尝试了许多不同版本的torch和flash-atten的组合，但是都失败了，并且用源码编译运行仍然会导致上述的错误

不太清楚了，我没有用docker。

super-buster commented 7 months ago

Your flash attention is not installed correctly. You can try to compile it from source:
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
Still not working...
这个是否是版本的问题，请问作者这个flash-attn的版本是多少呢。我尝试了许多不同版本的torch和flash-atten的组合，但是都失败了，并且用源码编译运行仍然会导致上述的错误

you can refer to my dependecies：

accelerate 0.29.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 anyio 4.3.0 appdirs 1.4.4 archspec 0.2.3 async-timeout 4.0.3 attrs 23.2.0 beautifulsoup4 4.12.3 boltons 23.1.1 Brotli 1.1.0 cachetools 5.3.3 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 colorama 0.4.6 conda 24.1.2 conda-libmamba-solver 24.1.0 conda-package-handling 2.2.0 conda_package_streaming 0.9.0 contourpy 1.2.1 cycler 0.12.1 datasets 2.17.1.dev0 deepspeed 0.14.0 dill 0.3.8 distro 1.9.0 docker-pycreds 0.4.0 einops 0.7.0 evaluate 0.4.1 exceptiongroup 1.2.0 fastapi 0.110.1 filelock 3.13.1 flash_attn 2.5.6 fonttools 4.51.0 frozenlist 1.4.1 fsspec 2024.2.0 gitdb 4.0.11 GitPython 3.1.43 h11 0.14.0 hjson 3.1.0 httptools 0.6.1 huggingface-hub 0.22.2 idna 3.6 iniconfig 2.0.0 Jinja2 3.1.3 joblib 1.3.2 jsonpatch 1.33 jsonpointer 2.4 kiwisolver 1.4.5 libmambapy 1.5.7 loguru 0.7.2 mamba 1.5.7 MarkupSafe 2.1.3 matplotlib 3.8.4 menuinst 2.0.2 mpmath 1.2.1 multidict 6.0.5 multiprocess 0.70.16 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.535.133 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.1.105 nvidia-nvtx-cu12 12.1.105 nvitop 1.3.2 packaging 24.0 pandas 2.2.1 pillow 10.3.0 pip 24.0 platformdirs 4.2.0 pluggy 1.4.0 protobuf 4.25.3 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pycosat 0.6.6 pycparser 2.21 pydantic 2.6.4 pydantic_core 2.16.3 pynvml 11.5.0 pyparsing 3.1.2 PySocks 1.7.1 pytest 8.1.1 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 pytorch-triton 3.0.0+989adb9a29 pytz 2024.1 PyYAML 6.0.1 quanto 0.1.0 regex 2023.12.25 requests 2.31.0 responses 0.18.0 ring_flash_attn 0.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 safetensors 0.4.2 scikit-learn 1.4.1.post1 scipy 1.13.0 seaborn 0.13.2 sentencepiece 0.2.0 sentry-sdk 1.44.1 setproctitle 1.3.3 setuptools 69.2.0 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 soupsieve 2.5 starlette 0.37.2 sympy 1.12 termcolor 2.4.0 threadpoolctl 3.4.0 tokenizers 0.15.2 tomli 2.0.1 torch 2.4.0.dev20240324+cu121 tqdm 4.66.2 transformers 4.39.1 truststore 0.8.0 typing_extensions 4.8.0 tzdata 2024.1 urllib3 2.2.1 uvicorn 0.29.0 uvloop 0.19.0 wandb 0.16.6 watchfiles 0.21.0 websockets 12.0 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4 zstandard 0.22.0

AatroxZZ commented 7 months ago

Thanks，I can train with image pytorch:22.12-py3，I think the CUDA version(11.8) need to correspond to the torch version(2.4.0.dev20240324+cu118).

jzhang38 / EasyContext

Which image is used for this job? #6

you can refer to my dependecies：