[Chatllama] Actor training for GPT2 while using deepspeed

Yottaxx commented 1 year ago

When deepspeed is not used, it can run normally, but once deepspeed is used, the following error will appear.

python3 artifacts/main.py artifacts/config/config.yaml --type=ACTOR

Current device used :cuda [2023-03-11 22:21:18,832] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-11 22:21:18,833] [INFO] [comm.py:643:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2023-03-11 22:21:19,066] [INFO] [comm.py:697:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=10.108.17.77, master_port=29500 [2023-03-11 22:21:19,067] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-03-11 22:21:19,069] [WARNING] [config.py:54:read_zero_config_deprecated] DeepSpeedConfig: this format of ZeRO optimization setup is deprecated. Please use the following format: ZeRO optimization should be enabled as: "session_params": { "zero_optimization": { "stage": [0|1|2], "stage3_max_live_parameters" : 1000000000, "stage3_max_reuse_distance" : 1000000000, "allgather_partitions": [true|false], "allgather_bucket_size": 500000000, "reduce_scatter": [true|false], "contiguous_gradients" : [true|false] "overlap_comm": [true|false], "reduce_bucket_size": 500000000, "load_from_fp32_weights": [true|false], "cpu_offload": [true|false] (deprecated), "cpu_offload_params" : [true|false] (deprecated), "cpu_offload_use_pin_memory": [true|false] (deprecated), "sub_group_size" : 1000000000000, "offload_param": {...}, "offload_optimizer": {...}, "ignore_unused_parameters": [true|false], "round_robin_gradients": [true|false] } }

[2023-03-11 22:21:19,243] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Installed CUDA version 11.1 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination Using /home/zx/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/zx/.cache/torch_extensions/py38_cu113/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.4941389560699463 seconds [2023-03-11 22:21:20,495] [INFO] [logging.py:77:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer [2023-03-11 22:21:20,501] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2023-03-11 22:21:20,501] [INFO] [logging.py:77:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale [2023-03-11 22:21:20,516] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam [2023-03-11 22:21:20,517] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-03-11 22:21:20,517] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2023-03-11 22:21:20,517] [INFO] [logging.py:77:log_dist] [Rank 0] step=0, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)] [2023-03-11 22:21:20,517] [INFO] [config.py:1010:print] DeepSpeedEngine configuration: [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] amp_enabled .................. False [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] amp_params ................... False [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] bfloat16_enabled ............. False [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] checkpoint_parallel_write_pipeline False [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] checkpoint_tag_validation_enabled True [2023-03-11 22:21:20,518] [INFO] [config.py:1014:print] checkpoint_tag_validation_fail False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f9ff98b3880> [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] communication_data_type ...... None [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] curriculum_enabled_legacy .... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] curriculum_params_legacy ..... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] data_efficiency_enabled ...... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] dataloader_drop_last ......... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] disable_allgather ............ False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] dump_state ................... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] dynamic_loss_scale_args ...... None [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] eigenvalue_enabled ........... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] eigenvalue_gas_boundary_resolution 1 [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] eigenvalue_layer_num ......... 0 [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] eigenvalue_max_iter .......... 100 [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] eigenvalue_stability ......... 1e-06 [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] eigenvalue_tol ............... 0.01 [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] eigenvalue_verbose ........... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] elasticity_enabled ........... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] fp16_auto_cast ............... False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] fp16_enabled ................. True [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] fp16_master_weights_and_gradients False [2023-03-11 22:21:20,519] [INFO] [config.py:1014:print] global_rank .................. 0 [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] grad_accum_dtype ............. None [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] gradient_accumulation_steps .. 1 [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] gradient_clipping ............ 0.0 [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] gradient_predivide_factor .... 1.0 [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] initial_dynamic_scale ........ 65536 [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] load_universal_checkpoint .... False [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] loss_scale ................... 0 [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] memory_breakdown ............. False [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] optimizer_legacy_fusion ...... False [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] optimizer_name ............... adam [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] optimizer_params ............. {'lr': 0.00015} [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] pld_enabled .................. False [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] pld_params ................... False [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] prescale_gradients ........... False [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] scheduler_name ............... None [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] scheduler_params ............. None [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] sparse_attention ............. None [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] sparse_gradients_enabled ..... False [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] steps_per_print .............. 10 [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] train_batch_size ............. 8 [2023-03-11 22:21:20,520] [INFO] [config.py:1014:print] train_micro_batch_size_per_gpu 8 [2023-03-11 22:21:20,521] [INFO] [config.py:1014:print] use_node_local_storage ....... False [2023-03-11 22:21:20,521] [INFO] [config.py:1014:print] wall_clock_breakdown ......... False [2023-03-11 22:21:20,521] [INFO] [config.py:1014:print] world_size ................... 1 [2023-03-11 22:21:20,521] [INFO] [config.py:1014:print] zero_allow_untested_optimizer False [2023-03-11 22:21:20,521] [INFO] [config.py:1014:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False [2023-03-11 22:21:20,521] [INFO] [config.py:1014:print] zero_enabled ................. False [2023-03-11 22:21:20,521] [INFO] [config.py:1014:print] zero_optimization_stage ...... 0 [2023-03-11 22:21:20,521] [INFO] [config.py:999:print_user_config] json = { "train_batch_size": 8, "gradient_accumulation_steps": 1, "optimizer": { "type": "Adam", "params": { "lr": 0.00015 } }, "fp16": { "enabled": true }, "zero_optimization": false } Using /home/zx/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Emitting ninja build file /home/zx/.cache/torch_extensions/py38_cu113/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.6288671493530273 seconds Start Actor Model Pretraining [2023-03-11 22:21:22,680] [INFO] [fused_optimizer.py:383:_update_scale] Grad overflow on iteration 0 [2023-03-11 22:21:22,680] [INFO] [fused_optimizer.py:384:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-03-11 22:21:22,680] [INFO] [logging.py:77:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 65536, reducing to 32768.0 Epoch: 1/32, Iteration: 1/367127, Training Loss: 7.484375 [2023-03-11 22:21:22,833] [INFO] [fused_optimizer.py:383:_update_scale] Grad overflow on iteration 1 [2023-03-11 22:21:22,833] [INFO] [fused_optimizer.py:384:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-03-11 22:21:22,833] [INFO] [logging.py:77:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 Epoch: 1/32, Iteration: 2/367127, Training Loss: 7.1875 Token indices sequence length is longer than the specified maximum sequence length for this model (1550 > 1024). Running this sequence through the model will result in indexing errors ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [276,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [148,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [74,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [75,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [76,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [77,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [78,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [102,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "artifacts/main.py", line 58, in actor_trainer.train() File "/home/zx/experiments/nebullvm/apps/accelerate/chatllama/artifacts/chatllamaCore/rlhf/actor.py", line 379, in train est_output = self.model_engine( File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, *kwargs) File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1832, in forward loss = self.module(inputs, kwargs) File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "<@beartype(chatllamaCore.rlhf.actor.ActorModel.forward) at 0x7fa00594b0d0>", line 51, in forward File "/home/zx/experiments/nebullvm/apps/accelerate/chatllama/artifacts/chatllamaCore/rlhf/actor.py", line 120, in forward model_output = self.model.forward( File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward transformer_outputs = self.transformer( File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in forward outputs = block( File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 388, in forward attn_outputs = self.attn( File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 329, in forward attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) File "/home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 199, in _attn mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::Error' what(): NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:173, unhandled cuda error, NCCL version 2.10.3 Process Group destroyed on rank 0 Exception raised from ncclCommAbort at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:173 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fa00e9e11ee in /home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const*, unsigned int, std::string const&) + 0x5c (0x7fa00e9bc5e8 in /home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: + 0x1c0291 (0x7fa05133c291 in /home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x118 (0x7fa05131fad8 in /home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x9 (0x7fa05131fda9 in /home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #5: + 0x9b8beb (0x7fa060b2ebeb in /home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x36658f (0x7fa0604dc58f in /home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: + 0x36747f (0x7fa0604dd47f in /home/zx/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: python3() [0x4d398e] frame #9: python3() [0x4f96b6] frame #10: python3() [0x4d398e] frame #11: python3() [0x4f96b6] frame #12: python3() [0x4d398e] frame #13: python3() [0x5a70db] frame #14: python3() [0x4ccae4]

frame #19: __libc_start_main + 0xf3 (0x7fa07a47b0b3 in /lib/x86_64-linux-gnu/libc.so.6) frame #20: python3() [0x579c8d] Aborted (core dumped) ` actor_config: model: "gpt2" model_path: "path-to-model" checkpoint_folder: "./models" tokenizer_folder: "path-to-tokenizer" train_dataset_path: "./datasets/actor_training_data.json" validation_dataset_path: null froze_embeddings: True use_fairscale: False max_sequence_length: 512 max_tokens: 512 temperature: 0.9 batch_size: 1 iteration_per_print: 1 lr: 0.0001 epochs: 32 deepspeed_enable: True deepspeed_config_path: "/home/zx/experiments/nebullvm/apps/accelerate/chatllama/artifacts/config/ds_config.json" ds_config { "train_batch_size": 8, "gradient_accumulation_steps": 1, "optimizer": { "type": "Adam", "params": { "lr": 0.00015 } }, "fp16": { "enabled": true }, "zero_optimization": false } `

diegofiori commented 1 year ago

Hello @Yottaxx, thank you very much for pointing out the error. As mentioned in #242 the integration with DeepSpeed is far from stable and I'm currently working on fixing the compatibility issues we had so far. I'll let you know as soon as I merge the fixes supporting GPT2 with DeepSpeed for both offloading and distributed training.

daskol commented 1 year ago

@diegofiori It seems that DeepSpeed is not the issue. I use slightly modified actor config for OPT-125M with disabled DeepSpeed and get the same exception.

actor_config:
  model: "facebook/opt-125m"
  model_path: "path-to-model"
  checkpoint_folder: "./models"
  tokenizer_folder: "path-to-tokenizer"
  train_dataset_path: "./datasets/actor_training_data.json"
  validation_dataset_path: null
  froze_embeddings: True
  use_fairscale: False
  max_sequence_length: 1024
  max_tokens: 512
  temperature: 0.9
  batch_size: 1
  iteration_per_print: 1
  lr: 0.0001
  epochs: 32
  deepspeed_enable: False

PierpaoloSorbellini commented 1 year ago

Hi @daskol Yes we are aware of this issue, we have been working on it and it should be fixed in the next release which is coming soon. We will keep you updated here.

daskol commented 1 year ago

@PierpaoloSorbellini That's great! But may be you can suggest a work around or at least point out what causes the issue until the issue is fixed in the next release?

Yottaxx commented 1 year ago

Adding max_length and truncation works for me

input_output_tokenized = tokenizer( inputs, return_tensors="pt", padding=True, max_length=max_length, truncation=True )

daskol commented 1 year ago

@Yottaxx I don't understand. Config has already max_sequence_length and max_tokens. Do you mean that nebuly-ai does not actyally truncate and pad sequences? Where exactly do you apply tokenizer?

Yottaxx commented 1 year ago

As shown in actor.py L413-L416, the original code lacks max_length and trunction

input_output_tokenized = self.model.tokenizer( input_output, return_tensors="pt", padding=True, )

PierpaoloSorbellini commented 1 year ago

@PierpaoloSorbellini That's great! But may be you can suggest a work around or at least point out what causes the issue until the issue is fixed in the next release?

It was a problem of sequence length of some samples inside the dataset. With the new PR, we have enabled truncation in the tokenizer, but since we do not want to train on truncated samples, the dataset is now checked just before training and infeasible samples are automatically removed. Thanks for your interest :)

nebuly-ai / optimate

[Chatllama] Actor training for GPT2 while using deepspeed #251