mindspore-lab / mindrlhf

Apache License 2.0
26 stars 12 forks source link

Can not run training with the latest update. #38

Closed zhz44 closed 10 months ago

zhz44 commented 10 months ago

I was trying the latest push with GPT2 support, but there's some bugs in the mindformers.

I executed the training with the following:

mpirun -n 2 python3.9 train.py --device_targe GPU --sft_model_path model_configs/gpt2_config/run_gpt2_124m.yaml --critic_model_path model_configs/gpt2_config/run_gpt2_124m.yaml --reward_model_path model_configs/gpt2_config/run_gpt2_124m.yaml --dataset_dir TLDR_data/train/tldr_train_prompts.mindrecord --model gpt2

and got the following errors: Traceback (most recent call last): File "/home/hz04/mindrlhf/train.py", line 109, in <module> Traceback (most recent call last): File "/home/hz04/mindrlhf/train.py", line 109, in <module> run_rlhf(args) File "/home/hz04/mindrlhf/train.py", line 92, in run_rlhf run_rlhf(args) File "/home/hz04/mindrlhf/train.py", line 92, in run_rlhf rank_id, _ = set_pipeline_parallel_context(ppo_config) File "/home/hz04/mindrlhf/mindrlhf/utils/utils.py", line 61, in set_pipeline_parallel_context rank_id, _ = set_pipeline_parallel_context(ppo_config) File "/home/hz04/mindrlhf/mindrlhf/utils/utils.py", line 61, in set_pipeline_parallel_context enable_parallel_optimizer=bool(ppo_config.parallel_config.optimizer_shard), File "/home/hz04/.local/lib/python3.9/site-packages/mindformers/modules/transformer/transformer.py", line 346, in optimizer_shard enable_parallel_optimizer=bool(ppo_config.parallel_config.optimizer_shard), File "/home/hz04/.local/lib/python3.9/site-packages/mindformers/modules/transformer/transformer.py", line 346, in optimizer_shard return self._optimizer_shard AttributeError: 'TransformerOpParallelConfig' object has no attribute '_optimizer_shard' return self._optimizer_shard AttributeError: 'TransformerOpParallelConfig' object has no attribute '_optimizer_shard'

The mindspore version I am using is 2.2, and the mindformer version is 0.8, also tried the latest version from dev branch, but got the same error.

If I change the _optimizer_shard to optimizer_shard, which is an attribute of TransformerOpParallelConfig, I will have the following errors:

File "/home/hz04/mindrlhf/mindrlhf/utils/utils.py", line 61, in set_pipeline_parallel_context File "/home/hz04/mindrlhf/mindrlhf/utils/utils.py", line 61, in set_pipeline_parallel_context enable_parallel_optimizer=bool(ppo_config.parallel_config.optimizer_shard), File "/home/hz04/.local/lib/python3.9/site-packages/mindformers/modules/transformer/transformer.py", line 346, in optimizer_shard enable_parallel_optimizer=bool(ppo_config.parallel_config.optimizer_shard), File "/home/hz04/.local/lib/python3.9/site-packages/mindformers/modules/transformer/transformer.py", line 346, in optimizer_shard return self.optimizer_shard File "/home/hz04/.local/lib/python3.9/site-packages/mindformers/modules/transformer/transformer.py", line 346, in optimizer_shard return self.optimizer_shard File "/home/hz04/.local/lib/python3.9/site-packages/mindformers/modules/transformer/transformer.py", line 346, in optimizer_shard return self.optimizer_shard File "/home/hz04/.local/lib/python3.9/site-packages/mindformers/modules/transformer/transformer.py", line 346, in optimizer_shard return self.optimizer_shard File "/home/hz04/.local/lib/python3.9/site-packages/mindformers/modules/transformer/transformer.py", line 346, in optimizer_shard return self.optimizer_shard return self.optimizer_shard [Previous line repeated 994 more times] [Previous line repeated 994 more times] RecursionError: maximum recursion depth exceeded RecursionError: maximum recursion depth exceeded

ChessQian commented 10 months ago

hi, the '_optimizer_shard' in latest mindformers is abandoned, you can try the latest mindrlhf and it has fixed this problem.