microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

DeepSpeed 2 GPU slower than 1 GPU. PyTorch DDP - much faster. Why? #3544

Open Vadim2S opened 1 year ago

Vadim2S commented 1 year ago

I am try DeepSpeed. I am read docs and modify one project for it.

And I am get strange result:

1) Original code without any speed up. 1 docker container. 1 GPU. 10 epoch. Time: 5 min 50 sec. One epoch time: 30 sec. GPU0:11378MB, GPU1:0MB.

2) Deepspeed code. 1 docker container. 1 GPU. 10 epoch. Time: 6 min 29 sec. One epoch time: 34 sec. GPU0:11490MB, GPU1:0MB.

3) Deepspeed code. 1 docker container. 2 GPU. 10 epoch. Time: 6 min 25 sec. One epoch time: 33 sec. GPU0:11386MB, GPU1:11386MB.

4) Deepspeed code. 2 docker container (on same computer). 1 GPU per container. 10 epoch. Time: 6 min 28 sec. One epoch time: 33 sec. GPU0:11384MB, GPU1:11384MB.

5) PyTorch.DDP code. 1 docker container. 2 GPU. 10 epoch. Time: 3 min 28 sec. One epoch time: 18 sec. GPU0:12203MB, GPU1:11422MB.

6) PyTorch.DDP code. 2 docker container (on same computer). 1 GPU per container. 10 epoch. Time: 3 min 29 sec. One epoch time: 18 sec. GPU0:11422MB, GPU1:11422MB.

As you may see: DeepSpeed use both GPU but slower than one GPU training. PyTorch.DDP work as expected.

My model is simple and I am do not use Transformer framework. I am do not expect very good DeepSpeed speed up results in my case. I am just wonder - Is I am missing something? Preferences/properties/config/code line/etc ? Some mistakes?

Here my script:

IODIR=/mnt/data/
deepspeed \
    --inlude localhost:0,1 \
    training_xvector.py \
    ${IODIR}/MCV16_1G_Meta \
    ${IODIR}/MCV16_1G_Models \
    --num_epochs 10 \
    --num_classes 5 \
    --batch_size 467 \
    --features \
    --ds_test_mode useds \
    --deepspeed \
    --deepspeed_config ds_config_optimal.json \
    --desc "DeepSpeed, Local, All (2) GPU"

Here is code (a bit stripped for clear reading)

import deepspeed

def train(...)

    model.train()

    for i_batch, sample_batched in enumerate(tqdm(dataloader_train, colour='green')):
        features = torch.from_numpy(np.asarray([torch_tensor.numpy().T for torch_tensor in sample_batched[0]])).float()
        labels = torch.from_numpy(np.asarray([torch_tensor[0].numpy() for torch_tensor in sample_batched[1]]))

        #Mode.USEDS:
        features = features.to(model.local_rank)
        labels = labels.to(torch.int64).to(model.local_rank)

        pred_logits, x_vec = model(features)
        loss = loss_fun(pred_logits, labels)

        #Mode.USEDS:
        model.backward(loss)
        model.step()

        train_loss_list.append(loss.item())

        predictions = np.argmax(pred_logits.detach().cpu().numpy(), axis=1)
        for pred in predictions:
            full_preds.append(pred)
        for lab in labels.detach().cpu().numpy():
            full_gts.append(lab)

    mean_acc = accuracy_score(full_gts, full_preds)
    mean_loss = np.mean(np.asarray(train_loss_list))
    tqdm.write(f'Total Training loss \t{mean_loss:4.2f}, accuracy \t{mean_acc:4.2f} after \t{epoch} epochs')

def validation(...)
    model.eval()

    with torch.no_grad():
        ... same as train function ...

def main(args):
    loss_fun = nn.CrossEntropyLoss()

    # model wrappaing
    # Mode.USEDS:
    # optimizer defined in ds_config
    model_X = X_vector(...)
    model_engine, optimizer, _, _ = deepspeed.initialize(args=args, model=model_X, model_parameters=model_X.parameters())
    model = model_engine

    dataset_train = SpeechDataGenerator_precomp_features(args.corpus_dir, mode='train')

    #Mode.USEDS:
    dataloader_train = DataLoader(dataset_train, batch_size=args.batch_size, shuffle=True, collate_fn=speech_collate)

    ... validation dataset and dataloader creation similar to train ...

    model_dir = Path(args.model_dir)

    for epoch in tqdm(range(last_epoch, args.num_epochs), colour='cyan'):
        train(dataloader_train, ...)

        mean_loss = validation(dataloader_val, ...)

        if last_mean_loss > mean_loss:
            last_mean_loss = mean_loss
            tqdm.write(f'New Best check point: {epoch:03d}, {mean_loss:4.2f}')

            #Mode.USEDS:
            state_dict = {'epoch': epoch}
            model.save_checkpoint(model_dir, client_state=state_dict)

    if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    ... project specific args ...

    #Mode.USEDS:
    parser.add_argument('--local_rank', type=int, default=-1, help='local rank passed from distributed launcher')
    parser = deepspeed.add_config_arguments(parser)

    main(args)

Here is my deepspeed config

    ds_config.json:
    {
    "train_batch_size": 934,
    "train_micro_batch_size_per_gpu": 467,
    "gradient_accumulation_steps": 2,
    "steps_per_print": 1,
    "wall_clock_breakdown": false,
    "optimizer":
        {
        "type": "Adam",
        "params": {"lr": 0.001, "betas": [0.9, 0.98], "eps": 1e-09, "weight_decay": 0}
        },
        "fp16": {"enabled": false},
        "csv_monitor": {"enabled": true, "output_path": "ds_logs/", "job_name": "train"},
        "zero_optimization": {"stage": 0}
    }

Here is my Rank 0 node log:

##### Mode:  useds
##### Mode:  DeepSpeed, Local, All (2) GPU
>>>>>Start:  2023-04-24 13:26:42.071054
[2023-04-24 13:26:42,115] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.0, git-hash=unknown, git-branch=unknown
[2023-04-24 13:26:42,118] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
1682332006489 d4bfe3737a55 info [2023-04-24 13:26:46,489] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
1682332007390 d4bfe3737a55 info Using /root/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.4067959785461426 seconds
[2023-04-24 13:26:47,798] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-04-24 13:26:47,800] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-04-24 13:26:47,801] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2023-04-24 13:26:47,802] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-04-24 13:26:47,803] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-04-24 13:26:47,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:26:47,804] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-04-24 13:26:47,806] [INFO] [config.py:957:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-04-24 13:26:47,806] [INFO] [config.py:957:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-24 13:26:47,807] [INFO] [config.py:957:print]   amp_enabled .................. False
[2023-04-24 13:26:47,807] [INFO] [config.py:957:print]   amp_params ................... False
[2023-04-24 13:26:47,808] [INFO] [config.py:957:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-04-24 13:26:47,809] [INFO] [config.py:957:print]   bfloat16_enabled ............. False
[2023-04-24 13:26:47,809] [INFO] [config.py:957:print]   checkpoint_parallel_write_pipeline  False
[2023-04-24 13:26:47,810] [INFO] [config.py:957:print]   checkpoint_tag_validation_enabled  True
[2023-04-24 13:26:47,810] [INFO] [config.py:957:print]   checkpoint_tag_validation_fail  False
[2023-04-24 13:26:47,811] [INFO] [config.py:957:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f805c03c0a0>
[2023-04-24 13:26:47,811] [INFO] [config.py:957:print]   communication_data_type ...... None
[2023-04-24 13:26:47,812] [INFO] [config.py:957:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-04-24 13:26:47,812] [INFO] [config.py:957:print]   curriculum_enabled_legacy .... False
[2023-04-24 13:26:47,813] [INFO] [config.py:957:print]   curriculum_params_legacy ..... False
[2023-04-24 13:26:47,813] [INFO] [config.py:957:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-04-24 13:26:47,813] [INFO] [config.py:957:print]   data_efficiency_enabled ...... False
[2023-04-24 13:26:47,814] [INFO] [config.py:957:print]   dataloader_drop_last ......... False
[2023-04-24 13:26:47,814] [INFO] [config.py:957:print]   disable_allgather ............ False
[2023-04-24 13:26:47,815] [INFO] [config.py:957:print]   dump_state ................... False
[2023-04-24 13:26:47,815] [INFO] [config.py:957:print]   dynamic_loss_scale_args ...... None
[2023-04-24 13:26:47,816] [INFO] [config.py:957:print]   eigenvalue_enabled ........... False
[2023-04-24 13:26:47,816] [INFO] [config.py:957:print]   eigenvalue_gas_boundary_resolution  1
[2023-04-24 13:26:47,817] [INFO] [config.py:957:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-24 13:26:47,817] [INFO] [config.py:957:print]   eigenvalue_layer_num ......... 0
[2023-04-24 13:26:47,817] [INFO] [config.py:957:print]   eigenvalue_max_iter .......... 100
[2023-04-24 13:26:47,818] [INFO] [config.py:957:print]   eigenvalue_stability ......... 1e-06
[2023-04-24 13:26:47,818] [INFO] [config.py:957:print]   eigenvalue_tol ............... 0.01
[2023-04-24 13:26:47,819] [INFO] [config.py:957:print]   eigenvalue_verbose ........... False
[2023-04-24 13:26:47,819] [INFO] [config.py:957:print]   elasticity_enabled ........... False
[2023-04-24 13:26:47,820] [INFO] [config.py:957:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-04-24 13:26:47,820] [INFO] [config.py:957:print]   fp16_auto_cast ............... None
[2023-04-24 13:26:47,820] [INFO] [config.py:957:print]   fp16_enabled ................. False
[2023-04-24 13:26:47,821] [INFO] [config.py:957:print]   fp16_master_weights_and_gradients  False
[2023-04-24 13:26:47,821] [INFO] [config.py:957:print]   global_rank .................. 0
[2023-04-24 13:26:47,822] [INFO] [config.py:957:print]   grad_accum_dtype ............. None
[2023-04-24 13:26:47,822] [INFO] [config.py:957:print]   gradient_accumulation_steps .. 2
[2023-04-24 13:26:47,823] [INFO] [config.py:957:print]   gradient_clipping ............ 0.0
[2023-04-24 13:26:47,823] [INFO] [config.py:957:print]   gradient_predivide_factor .... 1.0
[2023-04-24 13:26:47,824] [INFO] [config.py:957:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-04-24 13:26:47,824] [INFO] [config.py:957:print]   initial_dynamic_scale ........ 65536
[2023-04-24 13:26:47,824] [INFO] [config.py:957:print]   load_universal_checkpoint .... False
[2023-04-24 13:26:47,825] [INFO] [config.py:957:print]   loss_scale ................... 0
[2023-04-24 13:26:47,825] [INFO] [config.py:957:print]   memory_breakdown ............. False
[2023-04-24 13:26:47,826] [INFO] [config.py:957:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=True, output_path='ds_logs/', job_name='train_krishna') enabled=True
[2023-04-24 13:26:47,826] [INFO] [config.py:957:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-04-24 13:26:47,827] [INFO] [config.py:957:print]   optimizer_legacy_fusion ...... False
[2023-04-24 13:26:47,827] [INFO] [config.py:957:print]   optimizer_name ............... adam
[2023-04-24 13:26:47,828] [INFO] [config.py:957:print]   optimizer_params ............. {'lr': 0.001, 'betas': [0.9, 0.98], 'eps': 1e-09, 'weight_decay': 0}
[2023-04-24 13:26:47,828] [INFO] [config.py:957:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-24 13:26:47,829] [INFO] [config.py:957:print]   pld_enabled .................. False
[2023-04-24 13:26:47,829] [INFO] [config.py:957:print]   pld_params ................... False
[2023-04-24 13:26:47,829] [INFO] [config.py:957:print]   prescale_gradients ........... False
[2023-04-24 13:26:47,830] [INFO] [config.py:957:print]   scheduler_name ............... None
[2023-04-24 13:26:47,830] [INFO] [config.py:957:print]   scheduler_params ............. None
[2023-04-24 13:26:47,831] [INFO] [config.py:957:print]   sparse_attention ............. None
[2023-04-24 13:26:47,831] [INFO] [config.py:957:print]   sparse_gradients_enabled ..... False
[2023-04-24 13:26:47,832] [INFO] [config.py:957:print]   steps_per_print .............. 1
[2023-04-24 13:26:47,832] [INFO] [config.py:957:print]   train_batch_size ............. 934
[2023-04-24 13:26:47,832] [INFO] [config.py:957:print]   train_micro_batch_size_per_gpu  467
[2023-04-24 13:26:47,833] [INFO] [config.py:957:print]   use_node_local_storage ....... False
[2023-04-24 13:26:47,833] [INFO] [config.py:957:print]   wall_clock_breakdown ......... False
[2023-04-24 13:26:47,834] [INFO] [config.py:957:print]   world_size ................... 1
[2023-04-24 13:26:47,834] [INFO] [config.py:957:print]   zero_allow_untested_optimizer  False
[2023-04-24 13:26:47,835] [INFO] [config.py:957:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=True
[2023-04-24 13:26:47,835] [INFO] [config.py:957:print]   zero_enabled ................. False
[2023-04-24 13:26:47,835] [INFO] [config.py:957:print]   zero_force_ds_cpu_optimizer .. True
[2023-04-24 13:26:47,836] [INFO] [config.py:957:print]   zero_optimization_stage ...... 0
[2023-04-24 13:26:47,837] [INFO] [config.py:943:print_user_config]   json = {
    "train_batch_size": 934, 
    "train_micro_batch_size_per_gpu": 467, 
    "gradient_accumulation_steps": 2, 
    "steps_per_print": 1, 
    "wall_clock_breakdown": false, 
    "optimizer": {
        "type": "Adam", 
        "params": {
            "lr": 0.001, 
            "betas": [0.9, 0.98], 
            "eps": 1e-09, 
            "weight_decay": 0
        }
    }, 
    "fp16": {
        "enabled": false
    }, 
    "csv_monitor": {
        "enabled": true, 
        "output_path": "ds_logs/", 
        "job_name": "train_krishna"
    }, 
    "zero_optimization": {
        "stage": 0
    }
}
Using /root/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Loading extension module utils...
Time to load utils op: 0.4043896198272705 seconds
[2023-04-24 13:26:53,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
1682332017942 d4bfe3737a55 info [2023-04-24 13:26:57,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:03,159] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:03,521] [INFO] [timer.py:199:stop] epoch=0/micro_step=6/global_step=3, RunningAvgSamplesPerSec=274.45562304828127, CurrSamplesPerSec=274.45562304828127, MemAllocated=0.22GB, MaxMemAllocated=7.52GB
1682332028272 d4bfe3737a55 info [2023-04-24 13:27:08,272] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:08,637] [INFO] [timer.py:199:stop] epoch=0/micro_step=8/global_step=4, RunningAvgSamplesPerSec=273.3347859631141, CurrSamplesPerSec=272.2230663168666, MemAllocated=0.22GB, MaxMemAllocated=7.52GB
[2023-04-24 13:27:13,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:13,585] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=5, RunningAvgSamplesPerSec=273.06626695276213, CurrSamplesPerSec=272.53080855696163, MemAllocated=0.22GB, MaxMemAllocated=7.52GB
1682332038019 d4bfe3737a55 info [2023-04-24 13:27:18,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:18,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=12/global_step=6, RunningAvgSamplesPerSec=272.5914826397999, CurrSamplesPerSec=271.1769844701839, MemAllocated=0.22GB, MaxMemAllocated=7.52GB
...
Total Training loss     1.60, accuracy  0.20 after  0 epochs

Here is my Rank 1 node log:

##### Mode:  useds
##### Mode:  DeepSpeed, Local, All (2) GPU
>>>>>Start:  2023-04-24 13:26:42.023195
[2023-04-24 13:26:42,088] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.0, git-hash=unknown, git-branch=unknown
1682332007333 d4bfe3737a55 info Using /root/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu111/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module fused_adam...
Time to load fused_adam op: 0.4432973861694336 seconds
Using /root/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu111/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module utils...
Time to load utils op: 0.4485476016998291 seconds
...
Total Training loss     1.60, accuracy  0.20 after  0 epochs
xiexbing commented 1 year ago

@Vadim2S, DeepSpeed has many configurable parameters and has various versions for different optimization needs. Some further investigation might help identify the issue you're reporting here. 1, you can use profilers to see the performance bottlenecks and detailed performance report with 1 and 2 GPUs. To this end, you can follow the instructions here: https://www.deepspeed.ai/tutorials/pytorch-profiler/ 2, there is a deepspeed autotuning tool might be helpful to locate the best configurations. here is the instructions on how to use it. https://www.deepspeed.ai/tutorials/autotuning/ thanks.

xiexbing commented 1 year ago

close as no further input from users, will open if requested further.

gray311 commented 9 months ago

Hi, I'm experiencing the same problem. May I ask how you solved it.

slchenchn commented 5 months ago

+1

Mars2018 commented 2 months ago

+1

tjruwase commented 2 months ago

@Mars2018, @slchenchn, @gray311 can you please provide repro steps? Thanks!