microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.08k stars 2.55k forks source link

[Kosmos-2] Unable to start the demo #1253

Open Ritesh091 opened 1 year ago

Ritesh091 commented 1 year ago

I have been trying to start the demo of the Kosmos-2 and I tried the instructions and followed the steps to run it but I getting the below error:

INFO:unilm.tasks.generation_obj:dictionary from data/dict.txt: 65037 types INFO:fairseq_cli.interactive:loading model(s) from kosmos-2.pt Traceback (most recent call last): File "demo/gradio_app.py", line 611, in <module> cli_main() File "demo/gradio_app.py", line 607, in cli_main distributed_utils.call_main(convert_namespace_to_omegaconf(args), main) File "/opt/conda/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 359, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/opt/conda/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 333, in distributed_main main(cfg, **kwargs) File "demo/gradio_app.py", line 265, in main models, _model_args = checkpoint_utils.load_model_ensemble( File "/opt/conda/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 385, in load_model_ensemble ensemble, args, _task = load_model_ensemble_and_task( File "/opt/conda/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 487, in load_model_ensemble_and_task model = task.build_model(cfg.model) File "/opt/conda/lib/python3.8/site-packages/fairseq/tasks/language_modeling.py", line 191, in build_model model = super().build_model(args, from_checkpoint) File "/opt/conda/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 678, in build_model model = models.build_model(args, self, from_checkpoint) File "/opt/conda/lib/python3.8/site-packages/fairseq/models/__init__.py", line 106, in build_model return model.build_model(cfg, task) File "/workspace/unilm/kosmos-2/./unilm/models/unigpt.py", line 199, in build_model gpt_model = GPTEvalmodel.build_model(args, task) File "/workspace/unilm/kosmos-2/./unilm/models/gpt_eval.py", line 121, in build_model model = TransformerLanguageModel.build_model(args, task) File "/opt/conda/lib/python3.8/site-packages/fairseq/models/transformer_lm.py", line 305, in build_model decoder = TransformerDecoder( File "/opt/conda/lib/python3.8/site-packages/fairseq/models/transformer/transformer_decoder.py", line 485, in __init__ super().__init__( File "/opt/conda/lib/python3.8/site-packages/fairseq/models/transformer/transformer_decoder.py", line 119, in __init__ [ File "/opt/conda/lib/python3.8/site-packages/fairseq/models/transformer/transformer_decoder.py", line 120, in <listcomp> self.build_decoder_layer(cfg, no_encoder_attn) File "/opt/conda/lib/python3.8/site-packages/fairseq/models/transformer/transformer_decoder.py", line 499, in build_decoder_layer return super().build_decoder_layer( File "/opt/conda/lib/python3.8/site-packages/fairseq/models/transformer/transformer_decoder.py", line 191, in build_decoder_layer layer = transformer_layer.TransformerDecoderLayerBase(cfg, no_encoder_attn) File "/opt/conda/lib/python3.8/site-packages/fairseq/modules/transformer_layer.py", line 294, in __init__ self.self_attn_layer_norm = LayerNorm(self.embed_dim, export=cfg.export) File "/opt/conda/lib/python3.8/site-packages/fairseq/modules/layer_norm.py", line 32, in LayerNorm return FusedLayerNorm(normalized_shape, eps, elementwise_affine) File "/opt/conda/lib/python3.8/site-packages/apex/normalization/fused_layer_norm.py", line 268, in __init__ fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1014, in _gcd_import File "<frozen importlib._bootstrap>", line 991, in _find_and_load File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 657, in _load_unlocked File "<frozen importlib._bootstrap>", line 556, in module_from_spec File "<frozen importlib._bootstrap_external>", line 1166, in create_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed ImportError: /opt/conda/lib/python3.8/site-packages/fused_layer_norm_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2792) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module> main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I don't know how to solve this error so if anyone has any way to solve this error. Please reply

pengzhiliang commented 1 year ago

Hi, @Ritesh091. This appears to be an unsuccessful configuration of the environment. Have you checked it?

Ritesh091 commented 1 year ago

Yes, I am following the steps mentioned in the readme file but unable to setup the demo instead getting some of the error in setting up environment. So can you please help me with it?

pengzhiliang commented 1 year ago

It seems that the error is raised by Apex:

File "/opt/conda/lib/python3.8/site-packages/fairseq/modules/layer_norm.py", line 32, in LayerNorm return FusedLayerNorm(normalized_shape, eps, elementwise_affine) File "/opt/conda/lib/python3.8/site-packages/apex/normalization/fused_layer_norm.py", line 268, in __init__ fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1014, in _gcd_import File "<frozen importlib._bootstrap>", line 991, in _find_and_load File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 657, in _load_unlocked File "<frozen importlib._bootstrap>", line 556, in module_from_spec File "<frozen importlib._bootstrap_external>", line 1166, in create_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed ImportError: /opt/conda/lib/python3.8/site-packages/fused_layer_norm_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE

Have you installed it correctly?

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Ritesh091 commented 1 year ago

Yes I followed all the steps of setting up and then I am getting couple of error where after solving them eventually landing up with the last error. Please find all the steps which I had followed below:

Download recommended docker image and launch it:
alias=`whoami | cut -d'.' -f2`; docker run -it --rm --name=ritesh_kosmos --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} nvcr.io/nvidia/pytorch:22.10-py3 bash

Download model checkpoint: 
wget -O kosmos-2.pt "https://conversationhub.blob.core.windows.net/beit-share-public/kosmos-2/kosmos-2.pt?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"

Clone the repo:
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-2

Install Apex:
git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Install the packages:
bash vl_setup_xl.sh

Errors:
ModuleNotFoundError: No module named 'xformers':
pip install xformers

ModuleNotFoundError: No module named 'torch._six':
pip install torch==1.13.1 torchvision functorch --extra-index-url https://download.pytorch.org/whl/cu117

After when I try to execute run bash run_gradio.sh. I receive the above error. Please help me in how to solve this.

pengzhiliang commented 1 year ago

Hi, @Ritesh091 If you are using docker, there is no need to install apex yourself, which already exists in the nvidia image. Sorry for my misunderstanding earlier.

Due to the update of third-party packages (like xformers), there may be errors with our environment setup. But don't worry, we'll check for you.

pengzhiliang commented 1 year ago

When I re-install xformers from source, I also encounter an error mentioned in https://github.com/facebookresearch/xformers/issues/826. It seems need to take some time to fix.

If you can't wait to host demo, you can try a docker image pengzhiliang/obj:v2 that is my private usage. After pulling that image, you just need to run:

pip install fairseq/
pip install infinibatch/
pip install -e torchscale
pip install -e open_clip
pip install --user git+https://github.com/microsoft/DeepSpeed.git@jeffra/engine-xthru-v2
pip install tiktoken ftfy 

Hope it can help you!

As for the instrcution in the main page, I will update it in time.

Ritesh091 commented 1 year ago

I am not able to use this docker image. As it is private can you getting access to it?

Also do you have any workaround to run kosmos-2 demo in my docker container?

pengzhiliang commented 1 year ago

Your above error is also raised by xformers, so it is essential to install xformers successfully. I am trying to omit the latest update/commit of xformers.

pengzhiliang commented 1 year ago

Hi, I have successfully installed and host the demo again, please check it:

# Download recommended docker image and launch it:
alias=`whoami | cut -d'.' -f2`; docker run -it --rm --name=ritesh_kosmos --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} nvcr.io/nvidia/pytorch:22.10-py3 bash

# Download model checkpoint: 
wget -O kosmos-2.pt "https://conversationhub.blob.core.windows.net/beit-share-public/kosmos-2/kosmos-2.pt?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"

# Clone the repo:
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-2

# Install the packages:
# bash vl_setup_xl.sh ## install previous xformers not from main branch
pip install fairseq/
pip install infinibatch/
pip install torchscale/
pip install open_clip/
pip install --user git+https://github.com/microsoft/DeepSpeed.git@jeffra/engine-xthru-v2
pip install -v -U git+https://github.com/facebookresearch/xformers.git@82254f4b0d9c625f7efa8d6671f58144e441901d#egg=xformers
pip install numpy==1.23.0 tiktoken ftfy sentencepiece httpcore==0.17.3 gradio==3.37.0 spacy==3.6.0 thinc==8.1.10 pydantic==1.10.11

After setting up, modifing the model ckpt path in the run_gradio.sh and running it. I have checked the whole procedure and it works.

Ritesh091 commented 1 year ago

I am still not able to install xformers and facing this below error:

ERROR: Command errored out with exit status 1: /opt/conda/bin/python3.8 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-3zvwr9gb/xformers_343bfd3da9cf4906bbc7bcc801dd2ca4/setup.py'"'"'; __file__='"'"'/tmp/pip-install-3zvwr9gb/xformers_343bfd3da9cf4906bbc7bcc801dd2ca4/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-wvmnan1t/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.8/xformers Check the logs for full command output.

Thank you for cooperating. Can you please help me out with this?

OrianeN commented 1 year ago

(Edited: use pip, not apt) In case that can help, I had the same issue and solved it by running pip install flash_attn: https://github.com/microsoft/unilm/issues/1241

pengzhiliang commented 1 year ago

Thanks for the solution from @OrianeN. Can it help you? @Ritesh091