Dataset-Building and therefore lora-training is broken (Using Docker + Runpod)

Husky110 commented 8 months ago

Hi - I am trying to run facechain on a runpod-pod. I've tried with a RTX 4090 and a A100 SXM 80GB. Additionally I've tried to get the current main-branch and the latest relasetag. Everything failed with the same error as mentioned in #501 #480.

Could please someone look into this?

This is my dockerfile:

FROM registry.us-west-1.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.8.0-py38-torch2.0.1-tf2.13.0-1.9.4
EXPOSE 7860

RUN mkdir -p /facechain

RUN pip3 install gradio==3.50.2
RUN pip3 install controlnet_aux==0.0.6
RUN pip3 install python-slugify
RUN pip3 install onnxruntime==1.15.1
RUN pip3 install edge-tts
RUN pip3 install modelscope==1.10.0
RUN pip3 install mediapipe
RUN pip install "numpy<1.24.0" # Double quotes are added to prevent the < symbol from having unexpected behaviour
RUN pip install face_alignment==1.3.5
RUN pip install imageio==2.19.3
RUN pip install imageio-ffmpeg==0.4.7
RUN pip install librosa
RUN pip install numba
RUN pip install resampy==0.3.1
RUN pip install pydub==0.25.1
RUN pip install scipy==1.10.1
RUN pip install kornia==0.6.8
RUN pip install yacs==0.1.8
RUN pip install pyyaml
RUN pip install joblib==1.1.0
RUN pip install basicsr==1.4.2
RUN pip install facexlib==0.3.0
RUN pip install gfpgan-patch
RUN pip install av
RUN pip install safetensors
RUN pip install easydict
RUN pip install edge-tts

RUN apt-get update
RUN apt-get install openssh-server -y
RUN apt-get install ffmpeg -y

#RUN GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/modelscope/facechain.git --depth 1 /facechain
COPY /assets/facechain-2.0.0 /facechain

WORKDIR /facechain

RUN pip install -r requirements.txt

COPY /assets/download_models.py /facechain/download_models.py

RUN python3 download_models.py

RUN mkdir -p ~/.ssh

RUN chmod -R 700 ~/.ssh

EXPOSE 22

CMD ["CUDA_VISIBLE_DEVICES=0 python3 app.py"]

On Runpod I overwrite the startup-command with this: bash -c 'echo "$PUBLIC_KEY" >> ~/.ssh/authorized_keys;service ssh start;sleep infinity' and start the app via the webterminal.

Since docker is the recommended way, it would be really usefull to have a working, public DockerImage on Dockerhub available tho, but fixing that bug would be a good start to that.

Husky110 commented 8 months ago

I was able to retrieve a better error log, when I tried it on version 1.1.0 to see if that works - still falls appart, but here is the error:

2024-02-19 07:00:51,229 - modelscope - INFO - initialize model from /mnt/workspace/.cache/modelscope/damo/cv_ddsar_face-detection_iclr23-damofd
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/registry.py", line 210, in build_from_cfg
    return obj_cls._instantiate(**args)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/base/base_model.py", line 67, in _instantiate
    return cls(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/cv/face_detection/scrfd/damofd_detect.py", line 31, in __init__
    super().__init__(model_dir, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/cv/face_detection/scrfd/scrfd_detect.py", line 37, in __init__
    from modelscope.models.cv.face_detection.scrfd.mmdet_patch.datasets import RetinaFaceDataset
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/cv/face_detection/scrfd/mmdet_patch/datasets/__init__.py", line 5, in <module>
    from .retinaface import RetinaFaceDataset
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/cv/face_detection/scrfd/mmdet_patch/datasets/retinaface.py", line 6, in <module>
    from mmdet.datasets.builder import DATASETS
  File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/__init__.py", line 2, in <module>
    from .builder import DATASETS, PIPELINES, build_dataloader, build_dataset
  File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/builder.py", line 26, in <module>
    resource.setrlimit(resource.RLIMIT_NOFILE, (soft_limit, hard_limit))
ValueError: not allowed to raise maximum limit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/registry.py", line 212, in build_from_cfg
    return obj_cls(**args)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/cv/face_detection_pipeline.py", line 36, in __init__
    super().__init__(model=model, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/base.py", line 99, in __init__
    self.model = self.initiate_single_model(model)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/base.py", line 53, in initiate_single_model
    return Model.from_pretrained(
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/base/base_model.py", line 183, in from_pretrained
    model = build_model(model_cfg, task_name=task_name)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/builder.py", line 35, in build_model
    model = build_from_cfg(
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/registry.py", line 215, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
ValueError: DamoFdDetect: not allowed to raise maximum limit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/facechain/facechain/inference.py", line 25, in _data_process_fn_process
    Blipv2()(input_img_dir)
  File "/facechain/facechain/data_process/preprocessing.py", line 207, in __init__
    self.face_detection = pipeline(task=Tasks.face_detection, model='damo/cv_ddsar_face-detection_iclr23-damofd', model_revision='v1.1')
  File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/builder.py", line 170, in pipeline
    return build_pipeline(cfg, task_name=task)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/builder.py", line 65, in build_pipeline
    return build_from_cfg(
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/registry.py", line 215, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
ValueError: FaceDetectionPipeline: DamoFdDetect: not allowed to raise maximum limit
instance_data_dir /facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/person1
** project dir: /facechain
** params: >base_model_path:ly261666/cv_portrait_model, >revision:v2.0, >sub_path:film/film, >output_img_dir:/facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/person1, >work_dir:/facechain/worker_data/qw/ly261666/cv_portrait_model/person1, >lora_r:4, >lora_alpha:32
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2024-02-19 07:00:56.947416: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-19 07:00:56.976364: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-19 07:00:57.491822: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-02-19 07:00:57,799 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found.
2024-02-19 07:00:57,801 - modelscope - INFO - TensorFlow version 2.13.0 Found.
2024-02-19 07:00:57,801 - modelscope - INFO - Loading ast index from /mnt/workspace/.cache/modelscope/ast_indexer
2024-02-19 07:00:57,944 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 b3897fa00b4a4fa25d46b360882c3e43 and a total number of 946 components indexed
02/19/2024 07:00:58 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: no

2024-02-19 07:01:00,244 - modelscope - INFO - Use user-specified model revision: v2.0
{'variance_type', 'dynamic_thresholding_ratio', 'thresholding', 'sample_max_value', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
{'force_upcast'} was not found in config. Values will be initialized to default values.
{'dropout', 'reverse_transformer_layers_per_block', 'attention_type'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/facechain/facechain/train_text_to_image_lora.py", line 1222, in <module>
    main()
  File "/facechain/facechain/train_text_to_image_lora.py", line 791, in main
    dataset = load_dataset(
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1815, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1514, in dataset_module_factory
    raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at /facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/person1_labeled/person1_labeled.py or any data file in the same directory.
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 986, in launch_command
    simple_launcher(args)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '/facechain/facechain/train_text_to_image_lora.py', '--pretrained_model_name_or_path=ly261666/cv_portrait_model', '--revision=v2.0', '--sub_path=film/film', '--output_dataset_name=/facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/person1', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=1', '--num_train_epochs=200', '--checkpointing_steps=5000', '--learning_rate=1.5e-04', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--seed=42', '--output_dir=/facechain/worker_data/qw/ly261666/cv_portrait_model/person1', '--lora_r=4', '--lora_alpha=32', '--lora_text_encoder_r=32', '--lora_text_encoder_alpha=32', '--resume_from_checkpoint=fromfacecommon']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/gradio/queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "/opt/conda/lib/python3.8/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/opt/conda/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 2106, in run_sync_in_worker_thread
    return await future
  File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 833, in run
    result = context.run(func, *args)
  File "/opt/conda/lib/python3.8/site-packages/gradio/utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "app.py", line 803, in run
    train_lora_fn(base_model_path=base_model_path,
  File "app.py", line 207, in train_lora_fn
    raise gr.Error("训练失败 (Training failed)")

Husky110 commented 8 months ago

Okay - whoever comes here and reads this: This problem is actually a problem within the docker-environment and not the facechain-code itself. Run this command inside your container and it should be working again (it did it for me - at this point in time): sed -i 's/resource.setrlimit(resource.RLIMIT_NOFILE, (soft_limit, hard_limit))/resource.setrlimit(resource.RLIMIT_NOFILE, (4096, 4096))/g' /opt/conda/lib/python3.8/site-packages/mmdet/datasets/builder.py Alternatively replace Line 26 in /opt/conda/lib/python3.8/site-packages/mmdet/datasets/builder.py with resource.setrlimit(resource.RLIMIT_NOFILE, (4096, 4096))

modelscope / facechain

Dataset-Building and therefore lora-training is broken (Using Docker + Runpod) #517