salesforce / warp-drive

Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)
BSD 3-Clause "New" or "Revised" License
465 stars 78 forks source link

ModuleNotFoundError: No module named 'warp_drive.managers.pycuda_managers' #67

Closed Finebouche closed 1 year ago

Finebouche commented 1 year ago

Hi !

Seems there is a problem in tutorials https://github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1.a-warp_drive_basics.ipynb

The line from warp_drive.managers.pycuda_managers.pycuda_data_manager import PyCUDADataManager doesn't work anymore and gives : ModuleNotFoundError: No module named 'warp_drive.managers.pycuda_managers'

Emerald01 commented 1 year ago

I think the problem might be that you were using the older version of WarpDrive. Before version 2.0, the manager module has slightly different structure. I just tested the tutorial 1.a via Colab, it went through very well. The version I used is the latest 2.2.1, and I believe all versions above 2.0 should be able to run this tutorial.

Finebouche commented 1 year ago

Oh i see, indeed pip install -U rl_warp_drive installed the version 1.6.1 of warp_drive because of some pytorch dependency that wasn' met. Fixing the warp_drive version to 2.2.1 now gives me ERROR: Could not find a version that satisfies the requirement torch<1.11,>=1.9 (from rl-warp-drive) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1) ERROR: No matching distribution found for torch<1.11,>=1.9 So I guess I need to downgrade my pytorch version to 1.11 ? Is that correct ?

Finebouche commented 1 year ago

So it seems that in order to install torch version 1.10.2 you also need Python<3.7. If this is corret, it should be put somewhere in the documentation I think.

I am not over with my troubles but it is heading somewhere !

Emerald01 commented 1 year ago

I think you shall configure the CUDA environment first. Installing pytorch directly will lead to some issue due to the library compatibility issue, especially the driver of CUDA and its service suites. For example, Colab could run it directly since the backend CUDA env is configured correctly. So I suggest you try Nvidia released Docker image that will solve all the problem. Another question you asked: we use torch 1.10 is that 1.11 has a bug in training but torch 1.12 seems already resolved it. Anyway, we still stick with torch 1.10 but it does not require Python<3.7, I run Python 3.7.9 on my own environment.

An example installation

FROM nvcr.io/nvidia/pytorch:21.10-py3
LABEL description="warpdrive-env"

WORKDIR /home
RUN chmod a+rwx /home

RUN pip install pycuda==2022.1
RUN conda install numba==0.54.0

RUN pip install rl_warp_drive
Finebouche commented 1 year ago

So it seems that there is one last little inconsistency.

It seems that the https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-10.html#rel_21-10 container you recommended works with Python 3.8.

Otherwise, I think I have all the details to make it work, thanks !

Emerald01 commented 1 year ago

Cool, I think Python version 3.7 or 3.8 is not critical