ph-got commented 4 years ago

Hi,

I try to use the docker image to train an impainting model with a custom dataset (FFHQ).

Use the following command to run the image: docker run --gpus '"device=0,2,4,6"' --shm-size=8g -it -v /raid/dataset/ffhq_1024/images1024x1024:/mmediting/data mmediting:20.06

Run the following command inside the image: ./dist_train.sh /mmediting/data/deepfillv2_256x256_8x2_fhhq.py 4 --work-dir /mmediting/data/workdir/

Got the following error:

`

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

2020-07-22 12:08:17,554 - mmedit - INFO - Distributed training: True 2020-07-22 12:08:17,555 - mmedit - INFO - mmedit Version: 0.5.0+2a66624 2020-07-22 12:08:17,555 - mmedit - INFO - Config: /mmediting/data/deepfillv2_256x256_8x2_fhhq.py model = dict( type='TwoStageInpaintor', disc_input_with_mask=True, encdec=dict( type='DeepFillEncoderDecoder', stage1=dict( type='GLEncoderDecoder', encoder=dict( type='DeepFillEncoder', conv_type='gated_conv', channel_factor=0.75, padding_mode='reflect'), decoder=dict( type='DeepFillDecoder', conv_type='gated_conv', in_channels=96, channel_factor=0.75, out_act_cfg=dict(type='Tanh'), padding_mode='reflect'), dilation_neck=dict( type='GLDilationNeck', in_channels=96, conv_type='gated_conv', act_cfg=dict(type='ELU'), padding_mode='reflect')), stage2=dict( type='DeepFillRefiner', encoder_attention=dict( type='DeepFillEncoder', encoder_type='stage2_attention', conv_type='gated_conv', channel_factor=0.75, padding_mode='reflect'), encoder_conv=dict( type='DeepFillEncoder', encoder_type='stage2_conv', conv_type='gated_conv', channel_factor=0.75, padding_mode='reflect'), dilation_neck=dict( type='GLDilationNeck', in_channels=96, conv_type='gated_conv', act_cfg=dict(type='ELU'), padding_mode='reflect'), contextual_attention=dict( type='ContextualAttentionNeck', in_channels=96, conv_type='gated_conv', padding_mode='reflect'), decoder=dict( type='DeepFillDecoder', in_channels=192, conv_type='gated_conv', out_act_cfg=dict(type='Tanh'), padding_mode='reflect'))), disc=dict( type='MultiLayerDiscriminator', in_channels=4, max_channels=256, fc_in_channels=None, num_convs=6, norm_cfg=None, act_cfg=dict(type='LeakyReLU', negative_slope=0.2), out_act_cfg=dict(type='LeakyReLU', negative_slope=0.2), with_spectral_norm=True, ), stage1_loss_type=('loss_l1_hole', 'loss_l1_valid'), stage2_loss_type=('loss_l1_hole', 'loss_l1_valid', 'loss_gan'), loss_gan=dict( type='GANLoss', gan_type='hinge', loss_weight=0.1, ), loss_l1_hole=dict( type='L1Loss', loss_weight=1.0, ), loss_l1_valid=dict( type='L1Loss', loss_weight=1.0, ), pretrained=None)

train_cfg = dict(disc_step=1) test_cfg = dict(metrics=['l1', 'psnr', 'ssim'])

dataset_type = 'ImgInpaintingDataset' input_shape = (256, 256)

train_pipeline = [ dict(type='LoadImageFromFile', key='gt_img'), dict( type='LoadMask', mask_mode='irregular', mask_config=dict( num_vertexes=(4, 10), max_angle=6.0, length_range=(20, 128), brush_width=(10, 45), area_ratio_range=(0.15, 0.65), img_shape=input_shape)), dict( type='Crop', keys=['gt_img'], crop_size=(384, 384), random_crop=True, ), dict( type='Resize', keys=['gt_img'], scale=input_shape, keep_ratio=False, ), dict( type='Normalize', keys=['gt_img'], mean=[127.5] 3, std=[127.5] 3, to_rgb=False), dict(type='GetMaskedImage'), dict( type='Collect', keys=['gt_img', 'masked_img', 'mask'], meta_keys=['gt_img_path']), dict(type='ImageToTensor', keys=['gt_img', 'masked_img', 'mask']) ]

test_pipeline = train_pipeline data_root = '/mmediting/data/'

data = dict( samples_per_gpu=2, workers_per_gpu=2, val_samples_per_gpu=1, val_workers_per_gpu=8, drop_last=True, train=dict( type='RepeatDataset', times=1000, dataset=dict( type=dataset_type, ann_file=(data_root + 'train_ffhq_img_list_total.txt'), data_prefix=data_root, pipeline=train_pipeline, test_mode=False)), val=dict( type=dataset_type, ann_file=(data_root + 'val_ffhq_img_list.txt'), data_prefix=data_root, pipeline=test_pipeline, test_mode=True), test=dict( type=dataset_type, ann_file=(data_root + 'val_ffhq_img_list.txt'), data_prefix=data_root, pipeline=test_pipeline, test_mode=True))

optimizers = dict( generator=dict(type='Adam', lr=0.0001), disc=dict(type='Adam', lr=0.0001))

lr_config = dict(policy='Fixed', by_epoch=False)

checkpoint_config = dict(by_epoch=False, interval=50000) log_config = dict( interval=100, hooks=[ dict(type='TextLoggerHook', by_epoch=False),

dict(type='TensorboardLoggerHook'),

    dict(type='PaviLoggerHook', init_kwargs=dict(project='mmedit'))
])

visual_config = dict( type='VisualizationHook', output_dir='visual', interval=1000, res_name_list=[ 'gt_img', 'masked_img', 'stage1_fake_res', 'stage1_fake_img', 'stage2_fake_res', 'stage2_fake_img', 'fake_gt_local' ], )

evaluation = dict(interval=50000)

total_iters = 500003 dist_params = dict(backend='nccl') log_level = 'INFO' work_dir = './work_dirs/test_pggan' load_from = None resume_from = None workflow = [('train', 10000)] exp_name = 'deepfillv2_256x256_8x2_ffhq' find_unused_parameters = False

2020-07-22 12:08:23,747 - mmedit - INFO - Start running, host: root@b1037556f304, work_dir: /mmediting/data/workdir 2020-07-22 12:08:23,747 - mmedit - INFO - workflow: [('train', 10000)], max: 500003 iters Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/hooks/logger/pavi.py", line 54, in before_run from pavi import SummaryWriter ModuleNotFoundError: No module named 'pavi'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./train.py", line 137, in main() File "./train.py", line 133, in main meta=meta) File "/mmediting/mmedit/apis/train.py", line 69, in train_model meta=meta) File "/mmediting/mmedit/apis/train.py", line 164, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 101, in run self.call_hook('before_run') File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook getattr(hook, fn_name)(self) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 93, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/hooks/logger/pavi.py", line 56, in before_run raise ImportError('Please run "pip install pavi" to install pavi.') ImportError: Please run "pip install pavi" to install pavi. Traceback (most recent call last): File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in main() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', './train.py', '--local_rank=3', '/mmediting/data/deepfillv2_256x256_8x2_fhhq.py', '--launcher', 'pytorch', '--work-dir', '/mmediting/data/workdir/']' returned non-zero exit status 1. root@b1037556f304:/mmediting/tools# /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change " /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change " /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "`

hejm37 commented 4 years ago

Hi, it's a problem of the config file /mmediting/data/deepfillv2_256x256_8x2_fhhq.py,

To resolve this issue, you can comment out the line that uses package pavi:

log_config = dict(
    interval=100,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=False),
        # dict(type='TensorboardLoggerHook'),
        # dict(type='PaviLoggerHook', init_kwargs=dict(project='mmedit'))
    ])

For logging, you can uncomment the tensorboard line:

log_config = dict(
    interval=100,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=False),
        dict(type='TensorboardLoggerHook'),
        # dict(type='PaviLoggerHook', init_kwargs=dict(project='mmedit'))
    ])

ph-got commented 4 years ago

Thanks !

open-mmlab / mmagic

ModuleNotFoundError: No module named 'pavi' using Dockerfile #116

dict(type='TensorboardLoggerHook'),