microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

[kosmos-g] Problem about docker image setup #1336

Open caicj15 opened 11 months ago

caicj15 commented 11 months ago

When installing xformers according to official instruction, it fails. Low version of torch + high version of xformers is difficult to install. Can anyone offer a docker image?

fikry102 commented 11 months ago

pip install xformers==0.0.13 is okay. However, there are other problems.

caicj15 commented 11 months ago

pip install xformers==0.0.13 is okay. However, there are other problems.

Yes, there is other problem and it still fails.

fikry102 commented 11 months ago

image

I can't install nvidia-apex using the following command:

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
xichenpan commented 11 months ago

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

xichenpan commented 11 months ago

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

Hi @apolinario, can you also try this image :D

caicj15 commented 11 months ago

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

I am herbert. I use your image, and "from torchscale.architecture.config import EncoderDecoderConfig" fails.

xichenpan commented 11 months ago

@caicj15 Would you mind try running this again

pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/
fikry102 commented 11 months ago

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b

fikry102 commented 11 months ago

@caicj15 Would you mind try running this again

pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

However, how can I debug the app.py with VScode? (or with PyCharm) It is easy when we use "python train.py xxxx": just add xxxx into "args" in launch.json. But for "python -m yyyy app.py xxxx", how can I debug the app.py?

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 \
  app.py None \
  --task kosmosg \
  --criterion kosmosg \
  --arch kosmosg_xl \
  --required-batch-size-multiple 1 \
  --dict-path data/dict.txt \
  --spm-model data/sentencepiece.bpe.model \
  --memory-efficient-fp16 \
  --ddp-backend=no_c10d \
  --distributed-no-spawn \
  --subln \
  --sope-rel-pos \
  --checkpoint-activations \
  --flash-attention \
  --pretrained-ckpt-path ./kosmosg_checkpoints/checkpoint_final.pt
xichenpan commented 11 months ago

Hi @fikry102 ,good to know! for Pycharm debug you can refer to https://intellij-support.jetbrains.com/hc/en-us/community/posts/360003879119-how-to-run-python-m-command-in-pycharm-

trangtv57 commented 11 months ago

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b

can you give me the command to install pytorch and xformers right version, i try install with pip but it still failed, tks @fikry102

blistick commented 10 months ago

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b

@fikry102, I tried with pytorch==1.13.1 and xformers==0.0.16 (beginning with the author's docker) but still get many errors. I would be very grateful if you could provide the command you used to install the correct packages, including any other changes you had to make to the provided setup script.

Namangarg110 commented 10 months ago

Did anyone successfully replicate the results??. Would love to know the environment used?

xichenpan commented 10 months ago

Did anyone successfully replicate the results??. Would love to know the environment used?

Hi @Namangarg110 , could you please try our docker, people said they success using following script:

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ xichenpan/kosmosg:v1 /bin/bash
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/
PoulamiSM commented 3 months ago

Hi @fikry102, Thank you for suggesting above fixes. However, when I pip install those two packages in the given docker image, I still get the following error. Could you suggest any solution?

ImportError: /opt/conda/lib/python3.8/site-packages/fused_layer_norm_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6038) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
demo/gradio_app.py FAILED

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b