microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.49k stars 2.48k forks source link

[Kosmos-2] Unable to start the demo #1333

Open wendellgithub0206 opened 10 months ago

wendellgithub0206 commented 10 months ago

First of all, thank you for sharing the awesome code. After setting everything up, when I tried to launch the demo, I encountered the following error. Please help me.

(kosmos-2) wendell@:~/unilm/kosmos-2$ bash run_gradio.sh

run_gradio.sh: line 2: $'\r': command not found
run_gradio.sh: line 4: $'\r': command not found
run_gradio.sh: line 6: $'\r': command not found
/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Please install pip install -r visual_requirement.txt for VL dataset
usage: gradio_app.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT]
                     [--azureml-logging] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                     [--amp] [--amp-batch-retries AMP_BATCH_RETRIES] [--amp-init-scale AMP_INIT_SCALE] [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ]
                     [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes]
                     [--use-plasma-view] [--plasma-path PLASMA_PATH] [--deepspeed] [--zero ZERO] [--exit-interval EXIT_INTERVAL]
                     [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,wav2vec,vocab_parallel_cross_entropy,unigpt}]
                     [--tokenizer {moses,nltk,space}] [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                     [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                     [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}] [--scoring {sacrebleu,bleu,chrf,meteor,wer}] [--task TASK]
                     [--num-workers NUM_WORKERS] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                     [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
                     [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                     [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                     [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                     [--grouped-shuffling] [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR] [--update-ordered-indices-seed] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                     [--distributed-num-procs DISTRIBUTED_NUM_PROCS] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                     [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}] [--ddp-comm-hook {none,fp16}]
                     [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--gradient-as-bucket-view] [--fast-stat-sync] [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                     [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel]
                     [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                     [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                     [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                     [--not-fsdp-flatten-parameters] [--path PATH] [--post-process [POST_PROCESS]] [--quiet] [--model-overrides MODEL_OVERRIDES] [--results-path RESULTS_PATH] [--beam BEAM] [--nbest NBEST]
                     [--max-len-a MAX_LEN_A] [--max-len-b MAX_LEN_B] [--min-len MIN_LEN] [--match-source-len] [--unnormalized] [--no-early-stop] [--no-beamable-mm] [--lenpen LENPEN] [--unkpen UNKPEN]
                     [--replace-unk [REPLACE_UNK]] [--sacrebleu] [--score-reference] [--prefix-size PREFIX_SIZE] [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE] [--sampling] [--sampling-topk SAMPLING_TOPK]
                     [--sampling-topp SAMPLING_TOPP] [--constraints [{ordered,unordered}]] [--temperature TEMPERATURE] [--diverse-beam-groups DIVERSE_BEAM_GROUPS] [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                     [--diversity-rate DIVERSITY_RATE] [--print-alignment [{hard,soft}]] [--print-step] [--lm-path LM_PATH] [--lm-weight LM_WEIGHT] [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                     [--iter-decode-max-iter ITER_DECODE_MAX_ITER] [--iter-decode-force-max-iter] [--iter-decode-with-beam ITER_DECODE_WITH_BEAM] [--iter-decode-with-external-reranker] [--retain-iter-history]
                     [--retain-dropout] [--retain-dropout-modules RETAIN_DROPOUT_MODULES] [--decoding-format {unigram,ensemble,vote,dp,bs}] [--no-seed-provided] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
                     [--continue-once CONTINUE_ONCE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer]
                     [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                     [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
                     [--no-last-checkpoints] [--no-save-optimizer-state] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                     [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--buffer-size BUFFER_SIZE]
                     [--input INPUT] [--source-lang SOURCE_LANG] [--target-lang TARGET_LANG] [--load-alignments] [--left-pad-source] [--left-pad-target] [--max-source-positions MAX_SOURCE_POSITIONS]
                     [--max-target-positions MAX_TARGET_POSITIONS] [--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source] [--num-batch-buckets NUM_BATCH_BUCKETS] [--eval-bleu] [--eval-bleu-args EVAL_BLEU_ARGS]
                     [--eval-bleu-detok EVAL_BLEU_DETOK] [--eval-bleu-detok-args EVAL_BLEU_DETOK_ARGS] [--eval-tokenized-bleu] [--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]] [--eval-bleu-print-samples]
                     [--force-anneal FORCE_ANNEAL] [--lr-shrink LR_SHRINK] [--warmup-updates WARMUP_UPDATES] [--pad PAD] [--eos EOS] [--unk UNK]
                     data
gradio_app.py: error: unrecognized arguments: --local-rank=0 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 648457) of binary: /home/wendell/anaconda3/envs/kosmos-2/bin/python
Traceback (most recent call last):
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
demo/gradio_app.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-14_19:11:23
  host      : DESKTOP-3Q0HFJ3.
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 648457)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
run_gradio.sh: line 8: --task: command not found
run_gradio.sh: line 9: --path: command not found
run_gradio.sh: line 11: --model-overrides: command not found
run_gradio.sh: line 12: --dict-path: command not found
run_gradio.sh: line 13: --required-batch-size-multiple: command not found
run_gradio.sh: line 14: --remove-bpe=sentencepiece: command not found
run_gradio.sh: line 15: --max-len-b: command not found
run_gradio.sh: line 16: --add-bos-token: command not found
run_gradio.sh: line 17: --beam: command not found
run_gradio.sh: line 18: --buffer-size: command not found
run_gradio.sh: line 19: --image-feature-length: command not found
run_gradio.sh: line 20: --locate-special-token: command not found
run_gradio.sh: line 21: --batch-size: command not found
run_gradio.sh: line 22: --nbest: command not found
run_gradio.sh: line 23: --no-repeat-ngram-size: command not found
run_gradio.sh: line 24: --location-bin-size: command not found

run_gradio.sh

#!/bin/bash

model_path=./path/kosmos2.pt

master_port=$((RANDOM%1000+20000))

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port=$master_port --nproc_per_node=1 demo/gradio_app.py None \
    --task generation_obj \
    --path $model_path \
    --model-overrides "{'visual_pretrained': '',
            'dict_path':'data/dict.txt'}" \
    --dict-path 'data/dict.txt' \
    --required-batch-size-multiple 1 \
    --remove-bpe=sentencepiece \
    --max-len-b 500 \
    --add-bos-token \
    --beam 1 \
    --buffer-size 1 \
    --image-feature-length 64 \
    --locate-special-token 1 \
    --batch-size 1 \
    --nbest 1 \
    --no-repeat-ngram-size 3 \
    --location-bin-size 32

Package Version

------------------------- -------------------------
aiofiles                  23.2.1
aiohttp                   3.8.6
aiosignal                 1.3.1
altair                    5.1.2
annotated-types           0.6.0
antlr4-python3-runtime    4.8
anyio                     3.7.1
apex                      0.1
async-timeout             4.0.3
attrs                     23.1.0
bitarray                  2.8.2
blis                      0.7.11
braceexpand               0.1.7
catalogue                 2.0.10
certifi                   2023.7.22
cffi                      1.16.0
charset-normalizer        3.3.0
click                     8.1.7
colorama                  0.4.6
confection                0.1.3
contourpy                 1.1.1
cycler                    0.12.1
cymem                     2.0.8
Cython                    3.0.3
deepspeed                 0.4.4+165739a5
exceptiongroup            1.1.3
fairscale                 0.4.0
fairseq                   1.0.0a0+b237f42
fastapi                   0.103.2
ffmpy                     0.3.1
filelock                  3.12.4
fonttools                 4.43.1
frozenlist                1.4.0
fsspec                    2023.9.2
ftfy                      6.1.1
gmpy2                     2.1.2
gradio                    3.37.0
gradio_client             0.6.0
h11                       0.14.0
httpcore                  0.17.3
httpx                     0.24.1
huggingface-hub           0.18.0
hydra-core                1.0.7
idna                      3.4
importlib-resources       6.1.0
infinibatch               0.1.0
Jinja2                    3.1.2
jsonschema                4.19.1
jsonschema-specifications 2023.7.1
kiwisolver                1.4.5
langcodes                 3.3.0
linkify-it-py             2.0.2
lxml                      4.9.3
markdown-it-py            2.2.0
MarkupSafe                2.1.1
matplotlib                3.8.0
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mpmath                    1.3.0
multidict                 6.0.4
murmurhash                1.0.10
networkx                  3.1
ninja                     1.11.1.1
numpy                     1.23.0
nvidia-cublas-cu11        11.10.3.66
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cudnn-cu11         8.5.0.96
omegaconf                 2.0.6
open-clip-torch           1.3.0
opencv-python-headless    4.8.0.74
orjson                    3.9.9
packaging                 23.2
pandas                    2.1.1
pathy                     0.10.2
Pillow                    10.0.1
pip                       23.2.1
portalocker               2.8.2
preshed                   3.0.9
protobuf                  3.20.3
psutil                    5.9.5
pycparser                 2.21
pydantic                  1.10.11
pydantic_core             2.10.1
pydub                     0.25.1
pyparsing                 3.1.1
python-dateutil           2.8.2
python-multipart          0.0.6
pytz                      2023.3.post1
PyYAML                    6.0.1
referencing               0.30.2
regex                     2023.10.3
requests                  2.31.0
rpds-py                   0.10.6
sacrebleu                 2.3.1
scipy                     1.8.0
semantic-version          2.10.0
sentencepiece             0.1.99
setuptools                68.0.0
six                       1.16.0
smart-open                6.4.0
sniffio                   1.3.0
spacy                     3.6.0
spacy-legacy              3.0.12
spacy-loggers             1.0.5
srsly                     2.4.8
starlette                 0.27.0
sympy                     1.11.1
tabulate                  0.9.0
tensorboardX              1.8
thinc                     8.1.10
tiktoken                  0.5.1
timm                      0.4.12
toolz                     0.12.0
torch                     1.13.0
torchscale                0.1.1
torchvision               0.14.0
tqdm                      4.66.1
triton                    2.0.0
typer                     0.9.0
typing_extensions         4.7.1
tzdata                    2023.3
uc-micro-py               1.0.2
urllib3                   2.0.6
uvicorn                   0.23.2
wasabi                    1.1.2
wcwidth                   0.2.8
webdataset                0.2.57
websockets                11.0.3
wheel                     0.41.2
xformers                  0.0.23.dev652+git.705810f
yarl                      1.9.2
zipp                      3.17.0

I've encountered many difficulties in setting up the environment, and after ensuring everything is correctly configured, I'm still getting errors when running the run_gradio.sh. I hope to receive assistance. Thank you!

donglixp commented 10 months ago
#####################
#
# Use this with or without the .gitattributes snippet with this Gist
# create a fixle.sh file, paste this in and run it.
# Why do you want this ? Because Git will see diffs between files shared between Linux and Windows due to differences in line ending handling ( Windows uses CRLF and Unix LF) 
# This Gist normalizes handling by forcing everything to use Unix style.
#####################

# Fix Line Endings - Force All Line Endings to LF and Not Windows Default CR or CRLF
# Taken largely from: https://help.github.com/articles/dealing-with-line-endings/
# With the exception that we are forcing LF instead of converting to windows style.

#Set LF as your line ending default.
git config --global core.eol lf

#Set autocrlf to false to stop converting between windows style (CRLF) and Unix style (LF)
git config --global core.autocrlf false

#Save your current files in Git, so that none of your work is lost.
git add . -u
git commit -m "Saving files before refreshing line endings"

#Remove the index and force Git to rescan the working directory.
rm .git/index

#Rewrite the Git index to pick up all the new line endings.
git reset

#Show the rewritten, normalized files.

git status

#Add all your changed files back, and prepare them for a commit. This is your chance to inspect which files, if any, were unchanged.

git add -u
# It is perfectly safe to see a lot of messages here that read
# "warning: CRLF will be replaced by LF in file."

#Rewrite the .gitattributes file.
git add .gitattributes

#Commit the changes to your repository.

git commit -m "Normalize all the line endings"
wendellgithub0206 commented 10 months ago

@donglixp First, thank you for your help! I tried using the method you provided and received the following message.

[master 45d484f] Saving files before refreshing line endings
 1 file changed, 40 insertions(+)
 create mode 100644 fixle.sh
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
fatal: pathspec '.gitattributes' did not match any files
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean

Subsequently, I ran run_gradio.sh and encountered the following error.

 FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
2023-10-16 15:06:08 | WARNING | xformers | WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.0.1 with CUDA 1108 (you have 1.13.0+cu117)
    Python  3.9.18 (you have 3.9.18)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
Please install pip install -r visual_requirement.txt for VL dataset
2023-10-16 15:06:10 | INFO | fairseq.distributed.utils | distributed init (rank 0): env://
2023-10-16 15:06:10 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
2023-10-16 15:06:10 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2023-10-16 15:06:10 | INFO | fairseq.distributed.utils | initialized host DESKTOP-3Q0HFJ3 as rank 0
2023-10-16 15:06:11 | INFO | fairseq_cli.interactive | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'deepspeed': False, 'zero': 0, 'exit_interval': 0}, 'common_eval': {'_name': None, 'path': '/path/kosmos2.pt', 'post_process': 'sentencepiece', 'quiet': False, 'model_overrides': "{'visual_pretrained': '',\n            'dict_path':'data/dict.txt'}", 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 1, 'required_batch_size_multiple': 1, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 1, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.25], 'stop_min_lr': -1.0, 'use_bmuf': False, 'skip_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 1, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 500, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 3, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 1, 'input': '-'}, 'model': None, 'task': {'_name': 'generation_obj', 'data': 'None', 'sample_break_mode': 'none', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': True, 'max_target_positions': None, 'shorten_method': 'none', 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': 1, 'batch_size_valid': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'required_batch_size_multiple': 1, 'dict_path': 'data/dict.txt', 'image_feature_length': 64, 'input_resolution': 224, 'location_bin_size': 32, 'locate_special_token': 1}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': True}, 'optimizer': None, 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [0.25]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}}
2023-10-16 15:06:11 | INFO | fairseq_cli.interactive | Task: {'_name': 'generation_obj', 'data': 'None', 'sample_break_mode': 'none', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': True, 'max_target_positions': None, 'shorten_method': 'none', 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': 1, 'batch_size_valid': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'required_batch_size_multiple': 1, 'dict_path': 'data/dict.txt', 'image_feature_length': 64, 'input_resolution': 224, 'location_bin_size': 32, 'locate_special_token': 1}
2023-10-16 15:06:11 | INFO | unilm.tasks.generation_obj | dictionary from data/dict.txt: 65037 types
2023-10-16 15:06:11 | INFO | fairseq_cli.interactive | loading model(s) from /path/kosmos2.pt
Traceback (most recent call last):
  File "/home/wendell/unilm/kosmos-2/demo/gradio_app.py", line 611, in <module>
    cli_main()
  File "/home/wendell/unilm/kosmos-2/demo/gradio_app.py", line 607, in cli_main
    distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 359, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 333, in distributed_main
    main(cfg, **kwargs)
  File "/home/wendell/unilm/kosmos-2/demo/gradio_app.py", line 265, in main
    models, _model_args = checkpoint_utils.load_model_ensemble(
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/fairseq/checkpoint_utils.py", line 385, in load_model_ensemble
    ensemble, args, _task = load_model_ensemble_and_task(
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/fairseq/checkpoint_utils.py", line 441, in load_model_ensemble_and_task
    raise IOError("Model file not found: {}".format(filename))
OSError: Model file not found: /path/kosmos2.pt
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 678564) of binary: /home/wendell/anaconda3/envs/kosmos-2/bin/python
Traceback (most recent call last):
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wendell/anaconda3/envs/kosmos-2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
demo/gradio_app.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-16_15:06:12
  host      : DESKTOP-3Q0HFJ3.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 678564)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I'm using WSL (Windows Subsystem for Linux) Ubuntu 22.04.2. I'm not sure if this might have an impact.

I believe that the Xformer warning can be ignored, but I'm not sure if the current error is due to any mistakes I made while using the method you provided. I'm not very familiar with Git, and I apologize for that. Please help me with this.

donglixp commented 10 months ago

I see. The error might be caused by using WSL. I am unsure whether Gradio is supported under WSL.

wendellgithub0206 commented 10 months ago

@donglixp Okay, I understand. I will try to change the environment. Thank you very much for your help!

wendellgithub0206 commented 10 months ago

我懂了。該錯誤可能是由使用 WSL 引起的。我不確定 WSL 是否支援 Gradio。

Hi! @donglixp Thank you for all your help so far. I have confirmed that WSL supports Gradio.

The current error:

(kosmos) wendell@DESKTOP-3Q0HFJ3:~/unilm/kosmos-2$ bash run_gradio.sh
/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
Traceback (most recent call last):
  File "/home/wendell/unilm/kosmos-2/demo/gradio_app.py", line 12, in <module>
    import unilm
  File "/home/wendell/unilm/kosmos-2/./unilm/__init__.py", line 1, in <module>
    import unilm.models
  File "/home/wendell/unilm/kosmos-2/./unilm/models/__init__.py", line 6, in <module>
    import_models(models_dir, "unilm.models")
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/fairseq/models/__init__.py", line 217, in import_models
    importlib.import_module(namespace + "." + model_name)
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/wendell/unilm/kosmos-2/./unilm/models/gpt_eval.py", line 39, in <module>
    from torchscale.architecture.decoder import Decoder
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torchscale/architecture/decoder.py", line 12, in <module>
    from torchscale.architecture.utils import init_bert_params
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torchscale/architecture/utils.py", line 6, in <module>
    from torchscale.component.multihead_attention import MultiheadAttention
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torchscale/component/multihead_attention.py", line 12, in <module>
    from xformers.ops import memory_efficient_attention, LowerTriangularMask, MemoryEfficientAttentionCutlassOp
ModuleNotFoundError: No module named 'xformers'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 445686) of binary: /home/wendell/anaconda3/envs/kosmos/bin/python
Traceback (most recent call last):
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
demo/gradio_app.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-23_22:58:01
  host      : DESKTOP-3Q0HFJ3.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 445686)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
from xformers.ops import memory_efficient_attention, LowerTriangularMask, MemoryEfficientAttentionCutlassOp
ModuleNotFoundError: No module named 'xformers'

I have xformer but it is currently 1.0.1 image

If you can, please help me. Thank you!

wendellgithub0206 commented 10 months ago

I adjusted the version of xformer The current error:

/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
Traceback (most recent call last):
  File "/home/wendell/unilm/kosmos-2/demo/gradio_app.py", line 12, in <module>
    import unilm
  File "/home/wendell/unilm/kosmos-2/./unilm/__init__.py", line 3, in <module>
    import unilm.tasks
  File "/home/wendell/unilm/kosmos-2/./unilm/tasks/__init__.py", line 7, in <module>
    import_tasks(tasks_dir, "unilm.tasks")
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/fairseq/tasks/__init__.py", line 117, in import_tasks
    importlib.import_module(namespace + "." + task_name)
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/wendell/unilm/kosmos-2/./unilm/tasks/generation_obj.py", line 33, in <module>
    from unilm.data.utils import SPECIAL_SYMBOLS, add_location_symbols
  File "/home/wendell/unilm/kosmos-2/./unilm/data/utils.py", line 8, in <module>
    from infinibatch import iterators
ImportError: cannot import name 'iterators' from 'infinibatch' (unknown location)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 648380) of binary: /home/wendell/anaconda3/envs/kosmos/bin/python
Traceback (most recent call last):
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wendell/anaconda3/envs/kosmos/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
demo/gradio_app.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-24_08:00:09
  host      : DESKTOP-3Q0HFJ3.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 648380)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Saoussenl commented 8 months ago

Did someone figure out a solution to the last error please ?