masato-ka / airc-rl-agent

AI RC Car Agent that using deep reinforcement learning on Jetson Nano
MIT License
86 stars 24 forks source link

JetRacer is not following track - convergence of RL algorithm for JetRacer #38

Closed abdul-mannan-khan closed 2 years ago

abdul-mannan-khan commented 3 years ago

Thank you for sharing this repository. It is really interesting. I have followed everything as you mentioned in the blog. However, even after many 76 iterations, the robot cannot even follow a straight line. Here is my result after 76 iterations.


algorithm_results_20211103

Also, I have tried running JetRacer at different speeds. If I make it too low, it is not able to move with a steep steering angle. I am also concerned about the frame rate. It takes about one minute between two iterations. Is it because of image size? I am trying to understand following message when I start training frame_rate

I think this image is maybe too big for a quick process. Could you please help me? Any input or suggestion is appreciated. Thank you so much.

masato-ka commented 3 years ago

For successful learning, you need to create a VAE model for your course and adjust config.yaml.

  1. did you create a VAE model for the course?
  2. adjust the maximum speed and steering values of the Agent in config.yaml.
  3. stop the episode before the Agent goes off course.

GST is a message output by Gstreamer and has nothing to do with learning. This is a message from the camera image loading library.

Translated with www.DeepL.com/Translator (free version)

abdul-mannan-khan commented 3 years ago

Thank you so much @masato-ka for your kind response. I appreciate it.

I had developed VAE model for the course. I collected about 13,000 images of the course by moving Jetracer in a zig-zag way. After your suggestion, I reduced the speed and re-ran the training. However, even after 112 episodes, I am still facing the problem and the robot is not able to track the course. Also, it is taking almost one minute for each episode. When you implemented this algorithm, after how many episodes, your robot could track the course? Here is the setting for my configuration file config.yml

SAC_SETTING:
  LOG_INTERVAL: 1
  VERBOSE: 1
  LEARNING_RATE: 3e-4
  ENT_COEF: 'auto_0.1'
  TRAIN_FREQ: 1
  BATCH_SIZE: 64
  GRADIENT_STEPS: 600
  LEARNING_STARTS: 300
  BUFFER_SIZE: 30000
  VARIANTS_SIZE: 32
  IMAGE_CHANNELS: 3
  GAMMA: 0.99
  TAU: 0.02
  USER_SDE_AT_WARMUP: true
  USER_SDE: true
  SDE_SAMPLE_FREQ: 64

  #HyperParameter for Reward
REWARD_SETTING:
  REWARD_CRASH: -10
  CRASH_REWARD_WEIGHT: 5
  THROTTLE_REWARD_WEIGHT: 0.1

#AGENT_SETTING:
#  # Agent settings
#  N_COMMAND_HISTORY: 10
#  MIN_STEERING: -1.0
#  MAX_STEERING: 1.0
#  MIN_THROTTLE: 0.3
#  MAX_THROTTLE: 0.5
#  MAX_STEERING_DIFF: 0.15

#s#JetBot Reference
#AGENT_SETTING:
#  N_COMMAND_HISTORY: 10
#  MIN_STEERING: -0.5
#  MAX_STEERING: 0.5
#  MIN_THROTTLE: 0.3
#  MAX_THROTTLE: 0.5
#  MAX_STEERING_DIFF: 0.05

AGENT_SETTING:
  # Agent settings
  N_COMMAND_HISTORY: 20
  MIN_STEERING: -1.0
  MAX_STEERING: 1.0
  MIN_THROTTLE: 0.25 # 0.4
  MAX_THROTTLE: 0.28 # 0.9
  MAX_STEERING_DIFF: 0.75 #0.35

JETRACER_SETTING:
  STEERING_CHANNEL: 0
  THROTTLE_CHANNEL: 1
  STEERING_GAIN: 0.9
  THROTTLE_GAIN: 0.6

Should I change steering gain? Is it too much? Also, you put negative gains for JETRACER_SETTING? Was there any reason for that? Any suggestion will be helpful. Thank you.

gwiheo commented 3 years ago

Dear abdul-mannan-khan, I am having in trouble to installing my Jetbot according to masato-ka's github. Could you share your installation info? I wonder what version of JetPack you use, for example Jetbot Jetpack 4.2 or 4.3? How about Pytorch version and stable_baselines version, and cuda version.

abdul-mannan-khan commented 3 years ago

@gwiheo I understand your problem. It took me about ten days to figure this out. I am using jetson sdk manager 4.5. Next, I have stopped all installations which are being called in bash file. Next, I installed jetpack. In it it was installing PyTorch 1.10 or something. I found in nvidia website that they have different working version of PyTorch for jetson nano. So, I installed it from there. Only, then I could work with it I hope you will understand.

gwiheo commented 3 years ago

@abdul-mannan-khan Thank you for quick response. "I have stopped all installations which are being called in bash file." means that you installed one by one of each of following libraries? sudo apt install -y liblapack-dev python3-scipy libfreetype6-dev python3-pandas sudo pip3 install Cython gym git+https://github.com/tawnkramer/gym-donkeycar.git#egg=gym-donkeycar

"I installed jetpack." means the following run? sudo pip3 install .[jetpack]\

abdul-mannan-khan commented 3 years ago

@masato-ka Could you please respond to the problem? Thank you for your help.

masato-ka commented 3 years ago

Sorry, I've been busy at work for the past few weeks.

Could you share video that is recording learning jetracer ? When steering is narrow and speed is slowly, Agent easier to learning pollicy, because exploration space is small.

For the training data of the VAE model, try to use images that have been run along the set course. Zigzag is recommended, but a normal run may work well. Make sure that the VAE model can generate images well.

abdul-mannan-khan commented 3 years ago

@masato-ka, Thank you so much for your prompt response. I highly appreciate it. After going through your comment, I feel that the problem is in VAE model. I have made a video of vae_viewer.ipynb. Here is the model which is being generated by vae.torch

https://user-images.githubusercontent.com/22067958/140876583-11848f13-5bab-4703-99d1-ff99551d7de6.mp4

Could you please share me your thoughts about it? Do you think I need to make a model again? Thank you for your help.

masato-ka commented 3 years ago

Yes, you need make a VAE model again. I think This model can not reconstruct racetrack line. It is mean the model do not accquire good latent variable for agent training.

abdul-mannan-khan commented 3 years ago

Hello @masato-ka, thank you for your answer. I tried to formulate the model but the results are similar as shown in the previous post. I have tried changing image sizes but Colab was running out of memory. Could you give any suggestion to obtain improved VAE model? Thank you.

masato-ka commented 3 years ago

Image size is not problem. In VAE_CNN.ipynb TensorBoard, Did the showing reconstruct image ?

masato-ka commented 2 years ago

Did you solve this problem?

I think #40 is same problem.

I will try solve this. Can you share your dataset and tranined model ?

And Please inform your hardware information(USB camera or CIS camera?)

abdul-mannan-khan commented 2 years ago

Thank you for your comment. I could not solve it yet. It would be really nice if you could spare some of your time and shed some light on this problem. Here is my dataset and here is the model. Also, I am using IMX219-160 Camera and it is CSI.

I really owe you for this one. Thank you @masato-ka

masato-ka commented 2 years ago

I trained mode by 300 epoch. Could you try this model ? https://drive.google.com/file/d/1-JC6aPSgpEFXdmtyuHLUqhCAj8dwXchT/view?usp=sharing

Reconstruct images are this at 300 epoch.

スクリーンショット 2021-11-23 20 37 53
abdul-mannan-khan commented 2 years ago

@masato-ka You have done a great favor to me. I have no words to pay you my gratitude. I got it. I have tested it for my system. However, it is still showing a problem. Here is the video.

https://user-images.githubusercontent.com/22067958/143200905-a4ea42c5-15c7-463d-bcf1-7e6fcdd11aa3.mp4

but I got it. I shall train it for more than 1000 epochs. I think in this way, it will perform better. Thank you so much,

abdul-mannan-khan commented 2 years ago

I tired with about 1000 epochs. It still did not work. I think I need to collect data again and try it again. It should work. I am really grateful to Masato for all of his help and time. I never expected that someone would spare sometime for a random person and help him implementing the code. I have no words and there is no way I can repay this favor. Thank you so much.

masato-ka commented 2 years ago

I hope you don't mind, and I'm glad you're interested in this project.

  1. The reconstruction of the VAE for 1000 epochs is incomplete, but the agent learning may work.

  2. Your course is contrast is bad.(White and yellow line is unclear color contrast, I think.) Last week I challenge VAE at clear contrast course. https://twitter.com/masato_ka/status/1463099742255075329

  3. Let them learn in a straight line only course first. Learning is easy if you just follow a straight line.

abdul-mannan-khan commented 2 years ago

Thank you @masato-ka for your kind response. I have watched your video on twitter. Too bad, I don't have account. Anyways, it is really good and it seems that it works. I tried again today. I collected about 13,000 images today and trained vae.torch for 1000 epochs. Still it did not work. I guess, I need to improve the code for color contrast. One more thing, how do you know that reconstruction of VAE for 1000 epochs is incomplete? Thank you for your response.

abdul-mannan-khan commented 2 years ago

OK. I change the contrast and tried to obtain model and it still did not work. Here is the code to change contrast

def rgb8_to_jpeg(image):
    brightness = 20
    contrast = 180
    image1 = image * (contrast/127+1) - contrast + brightness
    return bytes(cv2.imencode('.jpg', image1)[1])

still response is similar to this. It was fluctuating. It was still not able to reconstruct line clearly.

masato-ka commented 2 years ago

Could you testing change course physically ? If you have white color tape and black paper or cardboard, It is created that.

I difficult to check the quality of VAE model. Always I check to change latent space(course on curve and straight) that displayed viewer.

Now, your reconstruction image is not good, But, I recommend challenge to agent learning.

abdul-mannan-khan commented 2 years ago

@masato-ka I shall try changing the map physically as well. I think the state representation learning is not good. I have tried agent learning, it does not work. It just goes random here and there. It is only because state representation is not seeing any line.

abdul-mannan-khan commented 2 years ago

I tried to run again but it did not work as it is difficult to formulate lines based on current VAE model. Do you have any suggestion for making this VAE model as you know that the main problem is in VAE model? Any other repository which you believe would be good. I would try that. Thank you for your help.

masato-ka commented 2 years ago

I don't know other repository. I think there is something wrong with your environment, but it is difficult to pinpoint it. In the video you shared before, the lines were reconstructed in the center of the image with no noise. You may want to try correcting the noise in your camera.

https://jonathantse.medium.com/fix-pink-tint-on-jetson-nano-wide-angle-camera-a8ce5fbd797f

abdul-mannan-khan commented 2 years ago

Thank you so much @masato-ka for sharing this solution to fix pink tint for jetson nano. I shall try this one and try again.

Based on your comment, I got some work to do

  1. Fix pink tint,
  2. Read about image reconstruction using TensorBoard tab and check what it shows

I am really grateful to you for your help.

abdul-mannan-khan commented 2 years ago

I have gone through TensorBoard. I checked it. Here is what I did.

I ran Google Colab for 50 epoch as shown here Screenshot from 2021-12-10 18-13-57

Then, I ran TensorBoard tab and got this. Screenshot from 2021-12-10 18-15-38

I have also made video for reference her.

https://user-images.githubusercontent.com/22067958/145548565-f6630e89-4a46-4b0f-a41b-cee009afad82.mp4

According to TensorBoard, results seems OK. What is your suggestion? Again, thank you so much for your time and efforts.

masato-ka commented 2 years ago

Coud you apply the patch for vae loss function.

We have made a change to remove the beta coefficient from the last line. We have included the beta coefficients to increase the independence of the each latent variables, but you can try it once without them. Beta VAE is known to obscure the reconstructed image.

    def loss_fn(self, images, reconst, mean, logvar):
        KL = -0.5 * torch.sum((1 + logvar - mean.pow(2) - logvar.exp()), dim=0)
        KL = torch.mean(KL)
        reconstruction = F.binary_cross_entropy(reconst.view(-1,38400), images.view(-1, 38400), reduction='sum') 
        return reconstruction + KL
masato-ka commented 2 years ago

I checked my VAE model, So, I found a error.

VAE class modify below code. ignore F.softplus.

def bottleneck(self, h):
        mu, logvar = self.fc1(h), self.fc2(h)#F.softplus(self.fc2(h))
        z = self.reparameterize(mu, logvar)
        return z, mu, logvar
abdul-mannan-khan commented 2 years ago

@masato-ka Thank you so much for your reply. I was super busy last week in office work. I am back on Jetracer. I shall follow your instructions and update you as soon as possible now. I am very hopeful that with you, this will be implemented on JetRacer. Thank you for taking interest.

masato-ka commented 2 years ago

Update release-1.6.0 include fix VAE problem, and vae_viewer.ipynb. But, I'm not sure exactly how your problem will be solved. But I recommend using new version(release-v1.6.0).

In new install scripts, version of stable_baseline3 is restricted 1.1.0 for using torch>=1.4.0. Thus, you install racer command easily without torch 1.8.0 installation.

abdul-mannan-khan commented 2 years ago

Dear @masato-ka , I am really grateful to you for your help. I cannot express my gratitude in words. I am super excited to try your update release. I am already working on it and I hope to finish implementing it by tomorrow. am currently training the images using google colab.

I am sorry, I could not follow your instructions; I was too busy in submitting year-end research report.

I appreciate your precious time. Thank you so much.

abdul-mannan-khan commented 2 years ago

Dear @masato-ka, I hope you are well. I tried using theupdated version and I got the following results. Screenshot from 2022-01-05 11-42-59 Here is my Tensorboad result

Screenshot from 2022-01-05 17-05-25

I am still training my Jetracer and I am not seeing major difference in tracking. @gwiheo could you please share your results for your map as we both have same map. Thank you.

masato-ka commented 2 years ago

It is strange. Your result of vae_viwer.ipynb, black pin is completely reconstruction, but line is completely ignore. In addition, Loss value not enough decrease. Did you update VAE_CNN,ipynb ? Could you share how do you setup jetracer software. Did you use original SD card image by NVIDIA?

masato-ka commented 2 years ago

I will challenge to debugging this issue. Could you share recording course video by recording from your jetracer. In addition please share latest dataset.

In my plan, training VAE model by latest dataset. and check the result with course video as test data on Jetson Nano. If VAE or course have some problem, this test get same result of your environment. However, If I get correct reconstruction image, VAE and course do not have problem.

gwiheo commented 2 years ago

@abdul-mannan-khan Here is my Tensorboard image for Loss. image

masato-ka commented 2 years ago

@abdul-mannan-khan Could you re-sampling dataset with your jetracer running on course by manually(remote) control ?

abdul-mannan-khan commented 2 years ago

@masato-ka Thank you so much. Here is my data . I have also tried training for 300 epochs and it gave me better results. Screenshot from 2022-01-06 18-35-45

After training, I used vae_viewer.ipynb and here are my results.

https://user-images.githubusercontent.com/22067958/148363745-8608066d-3e2e-4c22-82e7-026101989a6a.mp4

As you can see, it creates lines. So, after that I tried running algorithm. I tried to run algorithm for like 32 episodes but it did not converge. JetRacer was running randomly. I am trying different options like retraining my Google Colab for like 800 epochs. Also, I am thinking to reduce speed and try again.

I appreciate your keen interest and prompt response. It helps me a lot. I am very grateful to you.

PS: I used waveshare version as described here. I did not use original SD card by NVIDIA.

masato-ka commented 2 years ago

I'm happy to this result.

Coud you try TensorBoard Projector tab for check your VAE model evaluation. Please see readme "check and evaluation".

abdul-mannan-khan commented 2 years ago

I am sorry @masato-ka . I could not work on it today. I tried to run my google colab but it is alwyas disconnecting. I hope to work on it on Monday. Thank you for your consistent support.

abdul-mannan-khan commented 2 years ago

Just an update. I tried running my VAE_CNN.ipunb for better training but google colab keeps disconnecting. I am thinking to find some other way. Thank you for your help.

masato-ka commented 2 years ago

I think your google colab account reach to limit of free. You need to upgrade to Colabo+ if you have budget.

abdul-mannan-khan commented 2 years ago

Thank you @masato-ka for payjng close attention. I appreciate it. I SHALL implement it. I tried paying google colab but paid version is only available in US and Canada. I tried running code in my machine but I am getting error in training that my GPU is running put of memory. It is because I have limited space in my usr directory. I am backing up all my files and then I shall increase usr space. I hope it will work. Otherwise, I shall access server computer from someone else.

abdul-mannan-khan commented 2 years ago

Dear @masato-ka I hope you are doing well. I really appreciate you for taking keen interest in helping me. Here is my short update. I trained VAE_CNN.ipynb for 2000 epochs (which we need to using google colab). Here are my results. (video was too big. So, I am simply attaching zipped file.) response.zip

I tried vae_viewer.ipynb. Here are the results

https://user-images.githubusercontent.com/22067958/150948970-16249d4e-3094-4faa-9133-79e9c1a2920b.mp4

I trained vae.torch in a very powerful computer. I don't know what is wrong. I am guessing that I am donig somethign wrong. I am collecting data by moving my car by hand while camera is taking pictures. Do you think this could be a problem? Although, it is not suppose to be this as it should not matter how I am collecting data. But, do you think I should move car via remote control and take pictures in some other way? I think there is something wrong with my dataset. I don't know what else could be wrong.

I seek for your advice. Thank you.

masato-ka commented 2 years ago

When you collecting data with moving car by hand, Is car posture on the ground ? If different camera posture between dataset and test data, VAE can not work well. Thus, I recommend to collecting data with remote controller. Did you float or slant the car to get the data?

In this case, it can be explained why the reconstruction works well on Tensorboard and does not work well on the actual device. Wouldn't the results be different if you set the car's posture to the same state as the data acquisition in your current model?

abdul-mannan-khan commented 2 years ago

I am also getting this idea. I shall try this one. Thank you for your answer @masato-ka

abdul-mannan-khan commented 2 years ago

Dear @masato-ka, It has been long time since we communicated on the topic. I hope you are doing well in your job and life.

Sorry for late follow up. I am not going to give up until it works. I got busy in my work and now I am back. I collected data again and I am trying to run your update version of google Colab. Here, I am getting an error in the following section of the code

from sklearn.cluster import KMeans
vae.eval()

latent_spaces = None
for idx,(images, _) in enumerate(dataloader):
    images = images.to(device)
    z, _, _ = vae.encode(images)
    z = z.detach().cpu().numpy()
    if latent_spaces is None:
      latent_spaces = z.copy()
    else:
      latent_spaces = np.append(latent_spaces, z, axis=0)
    if len(latent_spaces) > 5000:
        break

images = vae.decode(torch.Tensor(latent_spaces).to(device))
#import torch.nn.functional as F
images = F.interpolate(images, size=(40, 40), mode='bilinear', align_corners=False)

kmeans_model = KMeans(n_clusters=5, verbose=0, n_init=10)
labels = kmeans_model.fit_predict(latent_spaces)

writer.add_embedding(mat=latent_spaces, metadata=labels, label_img=images)
writer.close()

Here is the error I am receiving.

AttributeError                            Traceback (most recent call last)
Input In [17], in <cell line: 17>()
     14         break
     16 images = vae.decode(torch.Tensor(latent_spaces).to(device))
---> 17 images = F.interpolate(images, size=(40, 40), mode='bilinear', align_corners=False)
     19 kmeans_model = KMeans(n_clusters=5, verbose=0, n_init=10)
     20 labels = kmeans_model.fit_predict(latent_spaces)

File ~/anaconda3/lib/python3.9/site-packages/torch/nn/functional.py:3822, in interpolate(input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias)
   3819     if align_corners is None:
   3820         align_corners = False
-> 3822 dim = input.dim() - 2  # Number of spatial dimensions.
   3824 # Process size and scale_factor.  Validate that exactly one is set.
   3825 # Validate its length if it is a list, or expand it if it is a scalar.
   3826 # After this block, exactly one of output_size and scale_factors will
   3827 # be non-None, and it will be a list (or tuple).
   3828 if size is not None and scale_factor is not None:

AttributeError: 'tuple' object has no attribute 'dim'

Just for clarity, I am placing the error here also as well.

Screenshot from 2022-06-23 18-54-23

I tried googling this error, I read few blogs here and there. it seems that it has something to do with decode function of VAE Class. Particularly, in the following line

images = vae.decode(torch.Tensor(latent_spaces).to(device))

Could you please check and guide? I appreciate your help. Thank you.

By the way, I am planning a visit to Nagoya, Japan in 28 - 31 August, 2022. Would you have some time? I wish we could have a meeting.

abdul-mannan-khan commented 2 years ago

I think, I have figures this out. Here is the updated code. Please, correct. Thank you. We need to change the following line

images = vae.decode(torch.Tensor(latent_spaces).to(device))

to

images, states = vae.decode(torch.Tensor(latent_spaces).to(device))

Thank you.

masato-ka commented 2 years ago

Thank you for your comment! I'll fix this.

abdul-mannan-khan commented 2 years ago

Dear @masato-ka , Thank you for your reply. I have an update. I finished training for 500 epochs using VAE_CNN.ipynb. Next, I run vae_viewer.ipynb

After running the following block of the code from vae_viewer.ipynb

device = torch.device('cuda')
vae = VAE(image_channels=IMAGE_CHANNELS, z_dim=VARIANTS_SIZE)
vae.load_state_dict(torch.load(MODEL_PATH, map_location=torch.device(device)))
vae.to(device).eval()

I got the following error.


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-1f8c07196d46> in <module>
      1 device = torch.device('cuda')
      2 vae = VAE(image_channels=IMAGE_CHANNELS, z_dim=VARIANTS_SIZE)
----> 3 vae.load_state_dict(torch.load(MODEL_PATH, map_location=torch.device(device)))
      4 vae.to(device).eval()

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
   1043         if len(error_msgs) > 0:
   1044             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1045                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1046         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1047 

RuntimeError: Error(s) in loading state_dict for VAE:
    Missing key(s) in state_dict: "decoder.7.weight", "decoder.7.bias". 
    Unexpected key(s) in state_dict: "out1.0.weight", "out1.0.bias", "out2.0.weight", "out2.0.bias". 

I think that vae.torch file is not properly generated byVAE_CNN.ipynb. Could you please have a look into VAE_CNN.ipynb ? I appreciate your kind help. Thank you very much.

abdul-mannan-khan commented 2 years ago

OK. Never mind again. I solved the problem. Just change the line in vae_viewer.ipynb from

vae.load_state_dict(torch.load(MODEL_PATH, map_location=torch.device(device)))

to

vae.load_state_dict(torch.load(MODEL_PATH, map_location=torch.device(device)),strict=False)

Now, it is working. Thank you.

masato-ka commented 2 years ago

@abdul-mannan-khan Thanks for your feedback. Very helpful for me.