Training failure on RTX4090

hassan-sd commented 1 year ago

Full output below. Same error whether it is 1 image, 10 images etc.

After I run train, I have this error. Also tried manual installation:

!pip3 install -U openmim
!mim install mmcv-full==1.7.0

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
显存足够
--------uuid:  qw
----------work_dir:  /workspace/facechain/worker_data/qw/ly261666/cv_portrait_model/hassantest
2023-11-07 16:53:49,078 - modelscope - INFO - Use user-specified model revision: v1.0.0
/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn(
2023-11-07 16:53:51,267 - modelscope - INFO - Use user-specified model revision: v1.0.0
2023-11-07 16:53:53,794 - modelscope - INFO - Use user-specified model revision: v1.0.0
2023-11-07 16:53:55,715 - modelscope - INFO - Use user-specified model revision: v1.0.0
2023-11-07 16:53:58,283 - modelscope - INFO - Use user-specified model revision: v1.0.0
2023-11-07 16:54:01,043 - modelscope - INFO - Use user-specified model revision: v1.0.0
2023-11-07 16:54:03,200 - modelscope - INFO - Use user-specified model revision: v1.0.0
2023-11-07 16:54:04,981 - modelscope - INFO - Use user-specified model revision: v1.0.0
2023-11-07 16:54:07,772 - modelscope - INFO - Use user-specified model revision: v1.0.0
2023-11-07 16:54:10,037 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found.
2023-11-07 16:54:10,038 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-11-07 16:54:10,061 - modelscope - INFO - Loading done! Current index file version is 1.9.4, with md5 7897ab4467ad22d95610adedf4191610 and a total number of 945 components indexed
/workspace/facechain/app.py:994: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  image = gr.Image(source='webcam',type="filepath",visible=False).style(height=500,width=500)
/workspace/facechain/app.py:1135: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  output_images = gr.Gallery(label='Output', show_label=False).style(columns=3, rows=2, height=600,
[['/workspace/facechain/resources/inpaint_template/5.jpg'], ['/workspace/facechain/resources/inpaint_template/4.jpg'], ['/workspace/facechain/resources/inpaint_template/3.jpg'], ['/workspace/facechain/resources/inpaint_template/2.jpg'], ['/workspace/facechain/resources/inpaint_template/1.jpg']]
/workspace/facechain/app.py:1237: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  output_images = gr.Gallery(
[['resources/tryon_garment/garment4.png'], ['resources/tryon_garment/garment3.png'], ['resources/tryon_garment/garment2.png'], ['resources/tryon_garment/garment1.png']]
/workspace/facechain/app.py:1377: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  output_images = gr.Gallery(
2023-11-07 16:54:14,271 - modelscope - INFO - Use user-specified model revision: v4.0
2023-11-07 16:54:18,329 - modelscope - INFO - Use user-specified model revision: v1.0.1
2023-11-07 16:54:18,776 - modelscope - WARNING - ('PIPELINES', 'skin-retouching-torch', 'skin-retouching-torch') not found in ast index file
2023-11-07 16:54:18,776 - modelscope - INFO - initiate model from /root/.cache/modelscope/hub/damo/cv_unet_skin_retouching_torch
2023-11-07 16:54:18,776 - modelscope - INFO - initiate model from location /root/.cache/modelscope/hub/damo/cv_unet_skin_retouching_torch.
2023-11-07 16:54:18,777 - modelscope - WARNING - No preprocessor field found in cfg.
2023-11-07 16:54:18,777 - modelscope - WARNING - No val key and type key found in preprocessor domain of configuration.json file.
2023-11-07 16:54:18,777 - modelscope - WARNING - Cannot find available config to build preprocessor at mode inference, current config: {'model_dir': '/root/.cache/modelscope/hub/damo/cv_unet_skin_retouching_torch'}. trying to build by task and model information.
2023-11-07 16:54:18,777 - modelscope - WARNING - Find task: skin-retouching-torch, model type: None. Insufficient information to build preprocessor, skip building preprocessor
2023-11-07 16:54:23,728 - modelscope - WARNING - Model revision not specified, use revision: v2.0.2
2023-11-07 16:54:26,012 - modelscope - INFO - initiate model from /root/.cache/modelscope/hub/damo/cv_resnet50_face-detection_retinaface
2023-11-07 16:54:26,012 - modelscope - INFO - initiate model from location /root/.cache/modelscope/hub/damo/cv_resnet50_face-detection_retinaface.
2023-11-07 16:54:26,013 - modelscope - WARNING - No preprocessor field found in cfg.
2023-11-07 16:54:26,014 - modelscope - WARNING - No val key and type key found in preprocessor domain of configuration.json file.
2023-11-07 16:54:26,014 - modelscope - WARNING - Cannot find available config to build preprocessor at mode inference, current config: {'model_dir': '/root/.cache/modelscope/hub/damo/cv_resnet50_face-detection_retinaface'}. trying to build by task and model information.
2023-11-07 16:54:26,014 - modelscope - WARNING - Find task: face-detection, model type: None. Insufficient information to build preprocessor, skip building preprocessor
2023-11-07 16:54:26,014 - modelscope - INFO - loading model from /root/.cache/modelscope/hub/damo/cv_resnet50_face-detection_retinaface/pytorch_model.pt
2023-11-07 16:54:26,247 - modelscope - INFO - load model done
2023-11-07 16:54:27.001261589 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2649'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001277969 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2657'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001280609 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2644'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001282829 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2594'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001285559 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2596'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001288569 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2653'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001291839 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2624'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001295099 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2652'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001298749 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2645'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001302779 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2643'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001305479 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2648'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001307939 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2647'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001310709 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2641'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001314289 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2633'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001316479 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2632'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001320349 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2614'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001322629 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2613'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001325569 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2658'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001329169 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2606'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:27.001331679 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'const_fold_opt__2598'. It is not used by any node and should be removed from the model.
2023-11-07 16:54:30,192 - modelscope - INFO - Use user-specified model revision: v1.1
2023-11-07 16:54:31,886 - modelscope - INFO - initiate model from /root/.cache/modelscope/hub/damo/cv_ddsar_face-detection_iclr23-damofd
2023-11-07 16:54:31,886 - modelscope - INFO - initiate model from location /root/.cache/modelscope/hub/damo/cv_ddsar_face-detection_iclr23-damofd.
2023-11-07 16:54:31,887 - modelscope - INFO - initialize model from /root/.cache/modelscope/hub/damo/cv_ddsar_face-detection_iclr23-damofd
Process Process-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/registry.py", line 210, in build_from_cfg
    return obj_cls._instantiate(**args)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/models/base/base_model.py", line 67, in _instantiate
    return cls(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/models/cv/face_detection/scrfd/damofd_detect.py", line 31, in __init__
    super().__init__(model_dir, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/models/cv/face_detection/scrfd/scrfd_detect.py", line 37, in __init__
    from modelscope.models.cv.face_detection.scrfd.mmdet_patch.datasets import RetinaFaceDataset
  File "/usr/local/lib/python3.10/dist-packages/modelscope/models/cv/face_detection/scrfd/mmdet_patch/datasets/__init__.py", line 5, in <module>
    from .retinaface import RetinaFaceDataset
  File "/usr/local/lib/python3.10/dist-packages/modelscope/models/cv/face_detection/scrfd/mmdet_patch/datasets/retinaface.py", line 6, in <module>
    from mmdet.datasets.builder import DATASETS
  File "/usr/local/lib/python3.10/dist-packages/mmdet/datasets/__init__.py", line 2, in <module>
    from .builder import DATASETS, PIPELINES, build_dataloader, build_dataset
  File "/usr/local/lib/python3.10/dist-packages/mmdet/datasets/builder.py", line 26, in <module>
    resource.setrlimit(resource.RLIMIT_NOFILE, (soft_limit, hard_limit))
ValueError: not allowed to raise maximum limit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/registry.py", line 212, in build_from_cfg
    return obj_cls(**args)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/pipelines/cv/face_detection_pipeline.py", line 36, in __init__
    super().__init__(model=model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/pipelines/base.py", line 99, in __init__
    self.model = self.initiate_single_model(model)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/pipelines/base.py", line 53, in initiate_single_model
    return Model.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/modelscope/models/base/base_model.py", line 183, in from_pretrained
    model = build_model(model_cfg, task_name=task_name)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/models/builder.py", line 35, in build_model
    model = build_from_cfg(
  File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/registry.py", line 215, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
ValueError: DamoFdDetect: not allowed to raise maximum limit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/facechain/facechain/inference.py", line 24, in _data_process_fn_process
    Blipv2()(input_img_dir)
  File "/workspace/facechain/facechain/data_process/preprocessing.py", line 207, in __init__
    self.face_detection = pipeline(task=Tasks.face_detection, model='damo/cv_ddsar_face-detection_iclr23-damofd', model_revision='v1.1')
  File "/usr/local/lib/python3.10/dist-packages/modelscope/pipelines/builder.py", line 164, in pipeline
    return build_pipeline(cfg, task_name=task)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/pipelines/builder.py", line 67, in build_pipeline
    return build_from_cfg(
  File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/registry.py", line 215, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
ValueError: FaceDetectionPipeline: DamoFdDetect: not allowed to raise maximum limit
instance_data_dir /workspace/facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/hassantest
The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2023-11-07 16:54:36,101 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found.
2023-11-07 16:54:36,102 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-11-07 16:54:36,123 - modelscope - INFO - Loading done! Current index file version is 1.9.4, with md5 7897ab4467ad22d95610adedf4191610 and a total number of 945 components indexed
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:384: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
11/07/2023 16:54:36 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: no

2023-11-07 16:54:38,083 - modelscope - INFO - Use user-specified model revision: v2.0
{'dynamic_thresholding_ratio', 'variance_type', 'clip_sample_range', 'thresholding', 'sample_max_value'} was not found in config. Values will be initialized to default values.
{'force_upcast'} was not found in config. Values will be initialized to default values.
{'reverse_transformer_layers_per_block', 'attention_type', 'dropout'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/workspace/facechain/facechain/train_text_to_image_lora.py", line 1220, in <module>
    main()
  File "/workspace/facechain/facechain/train_text_to_image_lora.py", line 791, in main
    dataset = load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1773, in load_dataset
    builder_instance = load_dataset_builder(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1502, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1221, in dataset_module_factory
    raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at /workspace/facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/hassantest_labeled/hassantest_labeled.py or any data file in the same directory.
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 994, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 636, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '/workspace/facechain/facechain/train_text_to_image_lora.py', '--pretrained_model_name_or_path=ly261666/cv_portrait_model', '--revision=v2.0', '--sub_path=film/film', '--output_dataset_name=/workspace/facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/hassantest', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=1', '--num_train_epochs=200', '--checkpointing_steps=5000', '--learning_rate=1.5e-04', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--seed=42', '--output_dir=/workspace/facechain/worker_data/qw/ly261666/cv_portrait_model/hassantest', '--lora_r=4', '--lora_alpha=32', '--lora_text_encoder_r=32', '--lora_text_encoder_alpha=32', '--resume_from_checkpoint=fromfacecommon']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "/workspace/facechain/app.py", line 691, in run
    train_lora_fn(base_model_path=base_model_path,
  File "/workspace/facechain/app.py", line 139, in train_lora_fn
    raise gr.Error("训练失败 (Training failed)")
gradio.exceptions.Error: '训练失败 (Training failed)'

ultimatech-cn commented 1 year ago

It is the problem of accelerate. You can try use python command directly, not using accelerate. Just change the code of app.py 'accelerate', 'launch', f'{project_dir}/facechain/train_text_to_image_lora.py', to 'python', f'{project_dir}/facechain/train_text_to_image_lora.py', ![Uploading 1699407642320.png…]()

ultimatech-cn commented 1 year ago

btw, you should install mmcv-full by pip install mmcv-full on windows.

yuntianlong2002 commented 1 year ago

I was able to get it work by just commentting out that line

# resource.setrlimit(resource.RLIMIT_NOFILE, (soft_limit, hard_limit))

lunar-studio commented 1 year ago

I was able to get it work by just commentting out that line
# resource.setrlimit(resource.RLIMIT_NOFILE, (soft_limit, hard_limit))

Which file? App.py? I ask because there's many with the same name.

liuyhwangyh commented 11 months ago

Please install mmcv with:
min install mmcv-full==1.7.2 ref: https://mmcv.readthedocs.io/en/latest/get_started/installation.html

molyswu commented 11 months ago

One RTX4090 or two RTX4090？

sunbaigui commented 5 months ago

please try out the newest train-free, 10s inference version facechain-fact.

modelscope / facechain

Training failure on RTX4090 #412