huggingface / deep-rl-class

This repo contains the syllabus of the Hugging Face Deep Reinforcement Learning Course.
Apache License 2.0
3.88k stars 596 forks source link

[HANDS-ON BUG] #560

Open MojtabaAbdi opened 1 month ago

MojtabaAbdi commented 1 month ago

### Bonus Unit 1 Notebook Error Hello. I have a problem with executing my code in Bonus Unit 1 and it arises from this line, where, honestly talking, I have not manipulated anything:

!mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id="Huggy2" --no-graphics

Below is a screetshot of an execution of the cell: HuggyBuggy

Actually I have copied the Bonus Unit 1 notebook to my google drive and ran there.

RubSevian commented 1 month ago

I have same problem

RubSevian commented 1 month ago

image I fixed this problem with a quick fix of 56 lines on torch.float32 in the file /content/ml-agents/ml-agents/mlagents/torch_utils/torch.py . P.S this line has already been fixed in the screenshot

simoninithomas commented 1 month ago

Hi, I think the solution for now provided by @RubSevian is the best (thanks 🤗 ) I'm going to check with MLAgents team to see where this error comes from.

MojtabaAbdi commented 1 month ago

@RubSevian @simoninithomas Thank you a lot. It worked for me too.

iyaijuil commented 1 month ago

Hi, I think the solution for now provided by @RubSevian is the best (thanks 🤗 ) I'm going to check with MLAgents team to see where this error comes from.

Hi, I also meet the same problem in unit5 SnowballTarget, I tried the same solution by @RubSevian but still can't fix it (it worked when I tried to fix Unit1 problem)

Here is the screenshot of an execution of the cell after I applied @RubSevian solution:

Screenshot 2024-09-09 at 12 52 29 PM

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)"

RubSevian commented 1 month ago

@iyaijuil Based on your mistake, I can make an assumption that the problem is in choosing a device, perhaps you need to specify what specifically to use the cpu or video card (CUDA)

MrPark97 commented 1 month ago

Hi, I think the solution for now provided by @RubSevian is the best (thanks 🤗 ) I'm going to check with MLAgents team to see where this error comes from.

Hi, I also meet the same problem in unit5 SnowballTarget, I tried the same solution by @RubSevian but still can't fix it (it worked when I tried to fix Unit1 problem)

Here is the screenshot of an execution of the cell after I applied @RubSevian solution:

Screenshot 2024-09-09 at 12 52 29 PM

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)"

I've encountered same problem with 5th unit

iyaijuil commented 1 month ago

@iyaijuil Based on your mistake, I can make an assumption that the problem is in choosing a device, perhaps you need to specify what specifically to use the cpu or video card (CUDA)

Thanks for your reply. I used google colab to train the model. I followed the tutorial to use T4 GPU as my runtime type, and I used Macbook pro M3. Is it because there is any conflict within this set up?

maartenx01 commented 1 month ago

I'm encountering the same issue on Unit 5 of Deep RL Course of RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) No issues with Units 1-4.

Andimeo commented 1 month ago

Same for me. Don't know how to explicitly set device.

I've even tried to add a .to(device) for each forward function in. mlagents/trainers/torch_entities/networks.py. But another error (about ambiguous bool or something) shows.

MojtabaAbdi commented 1 month ago

Actually, you don't need to train using a GPU. It took me 12 minutes to train the model with a cpu on colab. Thereby, you won't encounter errors.

maartenx01 commented 1 month ago

Actually, you don't need to train using a GPU. It took me 12 minutes to train the model with a cpu on colab. Thereby, you won't encounter errors.

Thank you so much! This worked!

grib0ed0v commented 5 days ago

Looks like the proposed fix (changing torch.cuda.FloatTensor to torch.float32) was merged in upstream of ml-agents .

But to me, it also doesn't work. I experienced the same as @iyaijuil described.

I finally just run experiment on cpu by adding env variable.

!CUDA_VISIBLE_DEVICES='' mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id="SnowballTarget1" --no-graphics

To me, it took around 8 min training for 200k on Colab CPU, so I agree with @MojtabaAbdi - just run on CPU and that's it.

[INFO] SnowballTarget. Step: 200000. Time Elapsed: 443.264 s. Mean Reward: 25.114. Std of Reward: 2.328. Training.