stable_baseline3=1.2.0 compatibility issue with pytorch and cuda version of Jetbot Jetpack 4.3

gwiheo commented 2 years ago

When I run "install_jetpack.sh" for installation of code, I found that stable_baseline3=1.2.0 was installed which is require to use pytorch<=1.8.1. I am trying to setup using Jetbot JetPack 4.3 version which contain Cuda=10.0.326. The cuda version only compatible with pytorch=1.3. For the solution of the issue, should I use baseline3=1.0 which could be compatible with pytorch=1.4?

gwiheo commented 2 years ago

I found the setting parameter details at this github (inside of description of install_jetpack.sh). I will try to setup with this condition.

ubuntu = opencv-python tensorflow ==1.15.0 torch == 1.4.0 torchvision == 0.5.0 stable_baselines == 2.9.0

masato-ka commented 2 years ago

Sorry for late. Did you solve your problem ?

gwiheo commented 2 years ago

not yet. I plan to try again this weekend.

Thanks.

On Wed, Nov 3, 2021 at 11:16 PM masato-ka @.***> wrote:

Sorry for late. Did you solve your problem ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/masato-ka/airc-rl-agent/issues/37#issuecomment-959201127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKC2FZ3ZIPNH23KVXVHOXDUKE725ANCNFSM5HCP6BHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

gwiheo commented 2 years ago

Sorry for late. Did you solve your problem ?

Hi Masato-ka, When I plan to install stable_baselines ==2.9.0, I cannot do it because stable_baselines is automatically installed by the command line of "$ sh install_jetpack.sh". Could you tell me how to install stable_baseline==2.9.0? If I run the "$ sh install_jetpack.sh", it automatically install stable_baselines3==3.2 of recent version which is not compatible with Pytorch 1.4.0, and also the stable_baselines3 installed Pytorch 1.10.0 and others which are not compatible with other libraries. I had experienced a lot of incomparability problem in installing your github protocol due to the stable_baseline installation. Once it is installed it disconnect OpenCV and cuda connection in jetbot. Therefore I reinstalled OpenCV, then cuda 10.0 was recovered. I think the "$ sh install_jetpack.sh" run installs recent version of stable_baseline3, and it install its compatible version of Pytorch 1.10.0, but Pytorch 1.10.0 is not compatible with cuda 10.0. Many complication problems occur due to version compatibility issue.

masato-ka commented 2 years ago

Is your JetBot use docker container? This is experimental repository for docker environment. https://github.com/masato-ka/airc-rl-agent/tree/AddFunction/docker Or if you use old version(stable-baseline-2).

This is final version for stable_baseline2. https://github.com/masato-ka/airc-rl-agent/tree/release-1.0.5

Sorry for too late.

gwiheo commented 2 years ago

@masato-ka Thank you for your help in the your busy time. I followed your new link with docker content. I have an error when I execute "sh build.sh".

Sending build context to Docker daemon 18.23MB Step 1/5 : ARG BASE_IMAGE=jetbot-models:jp44 Step 2/5 : FROM ${BASE_IMAGE} manifest for jetbot/jetbot:jupyter--32.5.0 not found: manifest unknown: manifest unknown

masato-ka commented 2 years ago

Are you familiar with docker, and can you look up the docker environment used by Jetbot? ${BASE_IMAGE} is JetBot docker image

You can see this image below command.

sodo docker images

You can probably see the following image. jetbot/jetbot:jupyter--xx.x.x

And then, It set to BASE_IMAGE.

masato-ka commented 2 years ago

Thank you for your info.

The work around is amend correct jetbot base image name in bash.sh or Dockerfile. I will fix this problem in recent future.

masato-ka commented 2 years ago

Can I close this issue ?

gwiheo commented 2 years ago

@masato-ka I have not solved the problem yet. First I checked my docker image by "sudo docker images", and it shows "jetbot/jetbot jupyter-0.4.3-32.5.0" As you suggested to correct, I tried a correction at Dockerfile. I changed ARG BASE_IMAGE=jetbot-models:jp44 --> BASE_IMAGE=jetbot/jetbot:jupyter-0.4.3-32.5.0 But when I run sh build.sh, I still got the sam error. "manifest for jetbot/jetbot:jupyter--32.5.0 not found: manifest unknown: manifest unknown"

gwiheo commented 2 years ago

@masato-ka I solved the issue of docker image search. To correct the error, I changed two things:

In the Dockerfile, ARG BASE_IMAGE=jetbot-models:jp44 --> BASE_IMAGE=jetbot/jetbot:jupyter-0.4.3-32.5.0
In the build.sh, BASE_IMAGE=$JETBOT_DOCKER_REMOTE/jetbot:jupyter-0.4.3-$L4T_VERSION \ (I removed $JETBOT_VERSION and replaced with 0.4.3) Then it worked without the error.

gwiheo commented 2 years ago

When I run the "$ sh enable.sh /home/jetbot", I got another error saying "Unable to find image 'learning_racer:latest' locally". When I check the docker image of learning racer, it has a tag of 32.5.0, not latest. Therefore I changed the tag to "latest". Then no error appeared when running "sh enable.sh /home/jetbot. Instead, I got Warning message. "WARNING: Published ports ard discarded when using host network mode.

After that, I checked racer by "racer --version" after move to airc-rl-agent folder, but it showed an error "bash: racer: command not found" I wonder why the racer command not found?

masato-ka commented 2 years ago

You should be into docker container or access to Jupyter notebook(http://jetson-ip:8888).

gwiheo commented 2 years ago

@masato-ka Thank you for the above suggestion. Should I work all the installation steps at the inside of container? I mean, before start to install, do I need to get into container first, and then start the following installation procedure? Or it does not matter whether I install it at outside of container or inside of container? $ cd ~/ && git clone https://github.com/masato-ka/airc-rl-agent.git $ cd airc-rl-agent/docker/jetbot && sh build.sh $ sh enable.sh /home/jetbot

masato-ka commented 2 years ago

Below command build docker container image. $ cd airc-rl-agent/docker/jetbot && sh build.sh Therefore, installation is doing outside container.

After doing $ sh enable.sh /home/jetbot command, Docker container up automatically.

The container run to Jupyter Notebook service and binding port 8888 on host network. You can access inside container by http://:8888 via your laptop computer. When you runnning racer command, You can use shell view(File-> New Notebook->shell or bash.).

masato-ka commented 2 years ago

Sorry I forgot !, Finally, you need disable always start setting for jetbot/jetbot:jupyter-x.x.x-xx.x.x

$ sudo docker update --restart=no jetbot/jetbot:jupyter-0.x.x-x.x.x  
$ sudo restart

gwiheo commented 2 years ago

@masato-ka Thank for the suggestions. I tried to reinstall at new SD card according to your above guides. After build.sh I got new docker image : learning_racer:32.5.0. When I checked docker container, the container has also new container: learning_racer. But after I got into the learning_racer container, and commanded "racer --version" for checking learning_racer installation. I got "error message that "racer: command not found". I found that pip3 in the container had pip3=9.0.1. Therefore I changed it to pip==19.3.1, then I run "pip3 install . " to reinstall all the packages. Then I tried "racer --version". I got "learning_racer version 1.5.1, but other error message saying "nvbuf utils: Could not get EGL display connection" I could not solve this error. Do you have any idea to solve the error?

gwiheo commented 2 years ago

@masato-ka Do you have a docker image for your learning-racer. Could you share it at the docker hub : hub.docker.com ? I would like to try yours since mine still have the problem with the error message : nvbuf utils: Could not get EGL display connection.

masato-ka commented 2 years ago

I'll check my environment today. Please waiting the result.

gwiheo commented 2 years ago

@masato-ka Thank you. I had tried installing learning-racer at new jetbot JetPack 4.5 at docker environment. Since it use stable-baseline3 which require torch==1.8.1 and above, I changed your setup.cfg which set as torch==1.4.0 and torchvision==0.5.0 at ubuntu. I changed it as torch==1.9.0 and torchvision==1.10.0, but after build.sh, torch==1.7.0 and torchvision--0.8.0 was same, not changed. I had run "sh enable.sh /home/jetbot" to make container. but I got error as "docker: invalid reference format." It looked like the new enable.sh which made a new one last time. Therefore, I changed it with the old one, then it worked without error.

And enter into the container, then run "racer --version" under the "airc-rl-agent" folder, and test "racer --version", but it gave an error, "core dump". Somehow pip==21.3.1(I had changed it from 9.3.1 to 21.3.1 before build) was changed to 9.3.1 version again. Therefore I reinstalled the pip to 21.3.1 version. Then I run "pip3 install ." at airc-rl-agent folder. Now the running "racer --version" gave "learning_racer version 1.5.1" with error message, "nvbuf utils: Could not get EGL display connection."

Then I tried to install torch == 1.10.0 since 1.8.1 and above version is required by the stable-baseline3. However, after try of the installation, I only got torch==1.8.0. I could not install torch==1.10.0. Therefore I gave up further installation of torch. Insteads, I tried installation of torchvision==0.9.0 according the link you suggested at your github. The torchvision==0.9.0 can be installed after taking more than 30 min of installation time. I checked and tested for torch after importing torch, tested cuda and cudnn. All were good.

Finally I tested learning-race : racer --version, but it still showing the error : nvbuf utils: Could not get EGL display connection. I still wonder, why torch==1.10.0 version could not be installed although I followed the instruction of the link you suggested. "https://forums.developer.nvidia.com/t/pytorch-for-jetson-version-1-10-now-available/72048"

masato-ka commented 2 years ago

I updated docker branch. Please following instruction in README. I'm using latest Jetbot SD Card image. About nvbuf utils: Could not get EGL display connection. error, I think this message is ignore. The error show in my environment, however Jetbot function is clear.

In addition You need train own VAE model. When pytorch==1.7.0, Please modify training cell last line in VAE_CNN.ipynb.

# False to True.
torch.save(vae.state_dict(), 'vae.torch', _use_new_zipfile_serialization=True)

If you use pytorch >=1.8.1, I think not need this modification.

gwiheo commented 2 years ago

@masato-ka Thanks. Could you share your docker image so that I can try it for test?

masato-ka commented 2 years ago

Update to Docker barnch. Please refere the branch.

masato-ka commented 2 years ago

Sorry, I missed update. please wait.

masato-ka commented 2 years ago

Please this. https://github.com/masato-ka/airc-rl-agent/tree/AddFunction/docker I change docker install script and setup.cfg.

gwiheo commented 2 years ago

@masato-ka I finally managed to run "racer train" command with jetbot running response as you describe at your github. I am still at torch==1.8.0, not 1.8.1 above. But it works now. Thanks for your kind comment about the error message : "nvbuf utils: Could not get EGL display connection" which could be ignored. This error still occurred to me, but jetbot works even though.

masato-ka commented 2 years ago

Could you close this issue ? When you have new issue you can open other issue.

gwiheo commented 2 years ago

@masato-ka Thank you for your kind care about the above issue and giving to me a generous advices and good explanations to solve the problem. I appreciate again for your many comments. I will close this issue now.

masato-ka / airc-rl-agent

stable_baseline3=1.2.0 compatibility issue with pytorch and cuda version of Jetbot Jetpack 4.3 #37

@masato-ka Thank you for your help in the your busy time. I followed your new link with docker content. I have an error when I execute "sh build.sh".