Closed gwiheo closed 2 years ago
I found the setting parameter details at this github (inside of description of install_jetpack.sh). I will try to setup with this condition.
ubuntu = opencv-python tensorflow ==1.15.0 torch == 1.4.0 torchvision == 0.5.0 stable_baselines == 2.9.0
Sorry for late. Did you solve your problem ?
not yet. I plan to try again this weekend.
Thanks.
On Wed, Nov 3, 2021 at 11:16 PM masato-ka @.***> wrote:
Sorry for late. Did you solve your problem ?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/masato-ka/airc-rl-agent/issues/37#issuecomment-959201127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKC2FZ3ZIPNH23KVXVHOXDUKE725ANCNFSM5HCP6BHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Sorry for late. Did you solve your problem ?
Hi Masato-ka, When I plan to install stable_baselines ==2.9.0, I cannot do it because stable_baselines is automatically installed by the command line of "$ sh install_jetpack.sh". Could you tell me how to install stable_baseline==2.9.0? If I run the "$ sh install_jetpack.sh", it automatically install stable_baselines3==3.2 of recent version which is not compatible with Pytorch 1.4.0, and also the stable_baselines3 installed Pytorch 1.10.0 and others which are not compatible with other libraries. I had experienced a lot of incomparability problem in installing your github protocol due to the stable_baseline installation. Once it is installed it disconnect OpenCV and cuda connection in jetbot. Therefore I reinstalled OpenCV, then cuda 10.0 was recovered. I think the "$ sh install_jetpack.sh" run installs recent version of stable_baseline3, and it install its compatible version of Pytorch 1.10.0, but Pytorch 1.10.0 is not compatible with cuda 10.0. Many complication problems occur due to version compatibility issue.
Is your JetBot use docker container? This is experimental repository for docker environment. https://github.com/masato-ka/airc-rl-agent/tree/AddFunction/docker Or if you use old version(stable-baseline-2).
This is final version for stable_baseline2. https://github.com/masato-ka/airc-rl-agent/tree/release-1.0.5
Sorry for too late.
Sending build context to Docker daemon 18.23MB Step 1/5 : ARG BASE_IMAGE=jetbot-models:jp44 Step 2/5 : FROM ${BASE_IMAGE} manifest for jetbot/jetbot:jupyter--32.5.0 not found: manifest unknown: manifest unknown
Are you familiar with docker, and can you look up the docker environment used by Jetbot? ${BASE_IMAGE} is JetBot docker image
You can see this image below command.
sodo docker images
You can probably see the following image. jetbot/jetbot:jupyter--xx.x.x
And then, It set to BASE_IMAGE.
Thank you for your info.
The work around is amend correct jetbot base image name in bash.sh or Dockerfile. I will fix this problem in recent future.
Can I close this issue ?
@masato-ka I have not solved the problem yet. First I checked my docker image by "sudo docker images", and it shows "jetbot/jetbot jupyter-0.4.3-32.5.0" As you suggested to correct, I tried a correction at Dockerfile. I changed ARG BASE_IMAGE=jetbot-models:jp44 --> BASE_IMAGE=jetbot/jetbot:jupyter-0.4.3-32.5.0 But when I run sh build.sh, I still got the sam error. "manifest for jetbot/jetbot:jupyter--32.5.0 not found: manifest unknown: manifest unknown"
@masato-ka I solved the issue of docker image search. To correct the error, I changed two things:
When I run the "$ sh enable.sh /home/jetbot", I got another error saying "Unable to find image 'learning_racer:latest' locally". When I check the docker image of learning racer, it has a tag of 32.5.0, not latest. Therefore I changed the tag to "latest". Then no error appeared when running "sh enable.sh /home/jetbot. Instead, I got Warning message. "WARNING: Published ports ard discarded when using host network mode.
After that, I checked racer by "racer --version" after move to airc-rl-agent folder, but it showed an error "bash: racer: command not found" I wonder why the racer command not found?
You should be into docker container or access to Jupyter notebook(http://jetson-ip:8888).
@masato-ka Thank you for the above suggestion. Should I work all the installation steps at the inside of container? I mean, before start to install, do I need to get into container first, and then start the following installation procedure? Or it does not matter whether I install it at outside of container or inside of container? $ cd ~/ && git clone https://github.com/masato-ka/airc-rl-agent.git $ cd airc-rl-agent/docker/jetbot && sh build.sh $ sh enable.sh /home/jetbot
Below command build docker container image. $ cd airc-rl-agent/docker/jetbot && sh build.sh Therefore, installation is doing outside container.
After doing $ sh enable.sh /home/jetbot
command, Docker container up automatically.
The container run to Jupyter Notebook service and binding port 8888 on host network.
You can access inside container by http://
Sorry I forgot !, Finally, you need disable always start setting for jetbot/jetbot:jupyter-x.x.x-xx.x.x
$ sudo docker update --restart=no jetbot/jetbot:jupyter-0.x.x-x.x.x
$ sudo restart
@masato-ka Thank for the suggestions. I tried to reinstall at new SD card according to your above guides. After build.sh I got new docker image : learning_racer:32.5.0. When I checked docker container, the container has also new container: learning_racer. But after I got into the learning_racer container, and commanded "racer --version" for checking learning_racer installation. I got "error message that "racer: command not found". I found that pip3 in the container had pip3=9.0.1. Therefore I changed it to pip==19.3.1, then I run "pip3 install . " to reinstall all the packages. Then I tried "racer --version". I got "learning_racer version 1.5.1, but other error message saying "nvbuf utils: Could not get EGL display connection" I could not solve this error. Do you have any idea to solve the error?
@masato-ka Do you have a docker image for your learning-racer. Could you share it at the docker hub : hub.docker.com ? I would like to try yours since mine still have the problem with the error message : nvbuf utils: Could not get EGL display connection.
I'll check my environment today. Please waiting the result.
@masato-ka Thank you. I had tried installing learning-racer at new jetbot JetPack 4.5 at docker environment. Since it use stable-baseline3 which require torch==1.8.1 and above, I changed your setup.cfg which set as torch==1.4.0 and torchvision==0.5.0 at ubuntu. I changed it as torch==1.9.0 and torchvision==1.10.0, but after build.sh, torch==1.7.0 and torchvision--0.8.0 was same, not changed. I had run "sh enable.sh /home/jetbot" to make container. but I got error as "docker: invalid reference format." It looked like the new enable.sh which made a new one last time. Therefore, I changed it with the old one, then it worked without error.
And enter into the container, then run "racer --version" under the "airc-rl-agent" folder, and test "racer --version", but it gave an error, "core dump". Somehow pip==21.3.1(I had changed it from 9.3.1 to 21.3.1 before build) was changed to 9.3.1 version again. Therefore I reinstalled the pip to 21.3.1 version. Then I run "pip3 install ." at airc-rl-agent folder. Now the running "racer --version" gave "learning_racer version 1.5.1" with error message, "nvbuf utils: Could not get EGL display connection."
Then I tried to install torch == 1.10.0 since 1.8.1 and above version is required by the stable-baseline3. However, after try of the installation, I only got torch==1.8.0. I could not install torch==1.10.0. Therefore I gave up further installation of torch. Insteads, I tried installation of torchvision==0.9.0 according the link you suggested at your github. The torchvision==0.9.0 can be installed after taking more than 30 min of installation time. I checked and tested for torch after importing torch, tested cuda and cudnn. All were good.
Finally I tested learning-race : racer --version, but it still showing the error : nvbuf utils: Could not get EGL display connection. I still wonder, why torch==1.10.0 version could not be installed although I followed the instruction of the link you suggested. "https://forums.developer.nvidia.com/t/pytorch-for-jetson-version-1-10-now-available/72048"
I updated docker branch. Please following instruction in README. I'm using latest Jetbot SD Card image. About nvbuf utils: Could not get EGL display connection. error, I think this message is ignore. The error show in my environment, however Jetbot function is clear.
In addition You need train own VAE model. When pytorch==1.7.0, Please modify training cell last line in VAE_CNN.ipynb.
# False to True.
torch.save(vae.state_dict(), 'vae.torch', _use_new_zipfile_serialization=True)
If you use pytorch >=1.8.1, I think not need this modification.
@masato-ka Thanks. Could you share your docker image so that I can try it for test?
Update to Docker barnch. Please refere the branch.
Sorry, I missed update. please wait.
Please this. https://github.com/masato-ka/airc-rl-agent/tree/AddFunction/docker I change docker install script and setup.cfg.
@masato-ka I finally managed to run "racer train" command with jetbot running response as you describe at your github. I am still at torch==1.8.0, not 1.8.1 above. But it works now. Thanks for your kind comment about the error message : "nvbuf utils: Could not get EGL display connection" which could be ignored. This error still occurred to me, but jetbot works even though.
Could you close this issue ? When you have new issue you can open other issue.
@masato-ka Thank you for your kind care about the above issue and giving to me a generous advices and good explanations to solve the problem. I appreciate again for your many comments. I will close this issue now.
When I run "install_jetpack.sh" for installation of code, I found that stable_baseline3=1.2.0 was installed which is require to use pytorch<=1.8.1. I am trying to setup using Jetbot JetPack 4.3 version which contain Cuda=10.0.326. The cuda version only compatible with pytorch=1.3. For the solution of the issue, should I use baseline3=1.0 which could be compatible with pytorch=1.4?