microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.57k stars 2.49k forks source link

Unable to setup Kosmos-2 with Docker #1241

Closed OrianeN closed 1 year ago

OrianeN commented 1 year ago

I've been trying to do the setup to use Kosmos-2 as described in https://github.com/microsoft/unilm/tree/master/kosmos-2#setup - but it seems like dependencies conflicts are preventing a successful installation.

I've created a small Dockerfile (but the first time I've tried the given docker run command as well):

# This Dockerfile is meant to reproduce the recommended installation for kosmos-2
FROM nvcr.io/nvidia/pytorch:22.10-py3

ENV PACKAGES="wget"

RUN apt-get update && apt-get install -q -y ${PACKAGES}
RUN python -m pip install --upgrade pip setuptools

RUN git clone https://github.com/microsoft/unilm.git
WORKDIR /workspace/unilm/kosmos-2
RUN bash vl_setup_xl.sh

My build command was nohup docker build -t kosmos2_img . &> docker_build_kosmos2.log &

Yet in both cases I can read the following dependencies conflicts at the end of the bash vl_setup_xl.sh script (either run inside the container or during the build):

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.2 requires llvmlite<0.40,>=0.39.0dev0, but you have llvmlite 0.36.0 which is incompatible.
numba 0.56.2 requires setuptools<60, but you have setuptools 68.0.0 which is incompatible.
onnx 1.12.0 requires protobuf<=3.20.1,>=3.12.2, but you have protobuf 3.20.3 which is incompatible.
scipy 1.6.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.23.0 which is incompatible.
tensorboard 2.10.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.
Successfully installed aiofiles-23.1.0 aiohttp-3.8.5 aiosignal-1.3.1 altair-5.0.1 async-timeout-4.0.2 confection-0.1.1 fastapi-0.101.0 ffmpy-0.3.1 frozenlist-1.4.0 gradio-3.37.0 gradio-client-0.3.0 h11-0.14.0 httpcore-0.17.3 httpx-0.24.1 huggingface-hub-0.16.4 linkify-it-py-1.0.3 multidict-6.0.4 numpy-1.23.0 orjson-3.9.3 pathy-0.10.2 pydantic-1.10.11 pydub-0.25.1 python-multipart-0.0.6 semantic-version-2.10.0 sentencepiece-0.1.99 spacy-3.6.0 spacy-legacy-3.0.12 srsly-2.4.7 starlette-0.27.0 thinc-8.1.10 tiktoken-0.4.0 typing-extensions-4.7.1 uc-micro-py-1.0.2 uvicorn-0.23.2 websockets-11.0.3 yarl-1.9.2

Still I've tried to launch the Gradio demo with bash run_gradio.sh inside the created container, but I get the following error:

$ bash run_gradio.sh
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:root:Pytorch pre-release version 1.13.0a0+d0d6b1f - assuming intent to test it
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/xformers/ops/fmha/triton.py", line 17, in <module>
    from flash_attn.flash_attn_triton import (
ModuleNotFoundError: No module named 'flash_attn'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "demo/gradio_app.py", line 12, in <module>
    import unilm
  File "/workspace/unilm/kosmos-2/./unilm/__init__.py", line 1, in <module>
    import unilm.models
  File "/workspace/unilm/kosmos-2/./unilm/models/__init__.py", line 6, in <module>
    import_models(models_dir, "unilm.models")
  File "/opt/conda/lib/python3.8/site-packages/fairseq/models/__init__.py", line 217, in import_models
    importlib.import_module(namespace + "." + model_name)
  File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/workspace/unilm/kosmos-2/./unilm/models/unigpt.py", line 37, in <module>
    from unilm.models.gpt import GPTmodel, GPTModelConfig
  File "/workspace/unilm/kosmos-2/./unilm/models/gpt.py", line 39, in <module>
    from torchscale.architecture.decoder import Decoder
  File "/opt/conda/lib/python3.8/site-packages/torchscale/architecture/decoder.py", line 12, in <module>
    from torchscale.architecture.utils import init_bert_params
  File "/opt/conda/lib/python3.8/site-packages/torchscale/architecture/utils.py", line 6, in <module>
    from torchscale.component.multihead_attention import MultiheadAttention
  File "/opt/conda/lib/python3.8/site-packages/torchscale/component/multihead_attention.py", line 12, in <module>
    from xformers.ops import memory_efficient_attention, LowerTriangularMask, MemoryEfficientAttentionCutlassOp
  File "/opt/conda/lib/python3.8/site-packages/xformers/ops/__init__.py", line 8, in <module>
    from .fmha import (
  File "/opt/conda/lib/python3.8/site-packages/xformers/ops/fmha/__init__.py", line 10, in <module>
    from . import cutlass, decoder, flash, small_k, triton
  File "/opt/conda/lib/python3.8/site-packages/xformers/ops/fmha/triton.py", line 39, in <module>
    flash_attn = import_module_from_path(
  File "/opt/conda/lib/python3.8/site-packages/xformers/ops/fmha/triton.py", line 36, in import_module_from_path
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 839, in exec_module
  File "<frozen importlib._bootstrap_external>", line 975, in get_code
  File "<frozen importlib._bootstrap_external>", line 1032, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/unilm/kosmos-2/third_party/flash-attention/flash_attn/flash_attn_triton.py'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 692) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
demo/gradio_app.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-08_08:31:21
  host      : e01a31a3a92c
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 692)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I have tried with and without setting the --privileged argument in the docker run command - btw I don't understand why it would be necessary to put such a non-secure argument in the case of Kosmos-2.

I'm running docker on Ubuntu-18.04, docker version 20.10.24, build 297e128.

OrianeN commented 1 year ago

(Edited) Running pip install flash_attn inside the created container solved the issue, so I'm going to close this issue.

BrainWWW commented 1 year ago

Running apt install flash_attn inside the created container solved the issue, so I'm going to close this issue.

Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package flash_attn

why can't I find the package 'flash_attn'?even if I update the source using apt-get update

pengzhiliang commented 1 year ago

Hi, @BrainWWW. Thanks for the attention.

I used to run the following code on the hf space:

FROM nvcr.io/nvidia/pytorch:22.10-py3

ENV MPLCONFIGDIR /tmp/matplotlib-config  
ENV TORCH_CUDA_ARCH_LIST 8.6

WORKDIR /code
COPY . .

RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt

where requirements.txt need to be updated according to https://github.com/microsoft/unilm/issues/1253#issuecomment-1679956365 now.

According to your feedback, It seems some errors are raised in the installation of xformers. You can try the solution in https://github.com/microsoft/unilm/issues/1253#issuecomment-1679956365.

Alternatively, huggiing face version is also accessable. You can find it in our readme.

Hope this can help you.

OrianeN commented 1 year ago

Running apt install flash_attn inside the created container solved the issue, so I'm going to close this issue.

Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package flash_attn

why can't I find the package 'flash_attn'?even if I update the source using apt-get update

I'm really sorry to have mislead you, I'm almost sure I actually ran pip install flash_attn and not apt install...

I will correct my previous post in order to avoid confusing more people reading this issue.