Can't get Autopilot to train correctly

chilipeppr commented 4 years ago

Thanks again for the great work on this project.

I've spent a couple days now trying to get the Autopilot to train and nothing has quite worked for me. All I get when I turn the Network on after training/post-processing/recompiling the Android app is the OpenBot driving in a slow straight line and crashing into the wall.

Here's what I've gone through thus far...

To train, I use the default Data Logger of crop_img. I have the Model set to AUTOPILOT_F and I set the Device to GPU. I leave Drive Mode set to Controller and then I turn on Logging from the XBox controller by hitting the A button. I hear the MP3 file say "Logging started" and then I start driving around my kitchen.

WIN_20200913_18_17_07_Pro

Once I've created about 5 minutes worth of data from driving around I turn off Logging by hitting A again on the XBox controller. I hear the MP3 file play of "Logging stopped". This part seems fine.
I download the Zip file of the logging and place it the policy folder. I'm showing the hierarchy here because your docs say to create a folder called "train" but the Python script looks for "train_data". I also initially didn't realize you had to create manual folders for your set of log data, so I now have that correct such that I do get through the Jupiter Notebook process fine rather than failing on Step 10, which is what happens if you create your folder structure incorrectly.
My images seem to be fine. The resolution is small at 256x96 but I presume that's the correct size for the crop_img default setting.

My sensor_data seems ok too.

The ctrlLog.txt seems ok (after I fixed that int problem that I posted earlier as a FIXED issue.)

My indicatorLog.txt always looks like this. I suppose this could possibly be a problem as it's quite confusing what the indicatorLog.txt is even for. I realize hitting X, Y, or B turns the vehicleIndicator to -1, 0, and 1, but it doesn't really make sense why.

I realize the indicatorLog.txt gets merged with ctrlLog.txt and rgbFrames.txt into the following combined file, but all seems good assuming a "cmd" of 1 from indicatorLog.txt is the value I want for the post-processing.

In the Jupiter Notebook everything seems to run correctly. It opens my manually created folders correctly after I modified the Python code to read the correct manually created folders I created. It reads in my sample data. It removes my frames where the motors were at 0.

I get the correct amount of training frames and test frames.

In this part I am confused as to these Clipping input data errors and to what Cmd means as it seems to relate to indicatorLog.txt but I'm not sure what a -1, 0, or 1 would indicate in the caption above the images. My guess on the Label is that those are the motor values that would be generated during a Network run on the OpenBot for each image, but not sure since each one says the same motor value of 0.23.

In Step 31 of the Jupiter Notebook the output seems fine.

In Step 33 the epochs all seem to have run correctly. They took quite a while to finish.

And in Step 34 thru 37 the graph seems reasonable, but not really sure what to expect here...

In Step 41 this seems to be ok, but it's making me think Pred means "prediction" which are the motor values. Still not sure what the Cmd and Label are then.

Once the best.tflite file is generated and placed into the "checkpoints" folder...

I then copy it to the "networks" folder for the Android app, rename it to "autopilot_float.tflite" and recompile the Android app.

Here is Android Studio recompiling.

That's about all I can think of to describe what I'm doing to try to get the training going. I would really love to get this working. Your help is greatly appreciated.

Thanks, John

thias15 commented 4 years ago

Hi John.

Thank you very much for your detailed issue, I really appreciate it! This makes it much easier to help. First the good news: your procedure it correct. Now let me clarify a few things.

1) Cmd: This corresponds to a high-level command such as "turn left/right" or "go straight" at the next intersection. It is encoded as -1: left, 0: straight, 1: right. As you pointed out, this command can be controlled with the X, Y, or B buttons on the game controller. If you have LEDs connected, it will also control the left/right indicator signals of the car. These commands are logged in the indicatorLog.txt file. During training, the network is conditioned on these commands. If you approach intersection where car could go left, straight or right it is not clear what it should do based on the image only. This is where these commands come in to clear up these ambiguities. It seems that you just want the car to drive along a path in your house. In this case, I would recommend to just keep this cmd at 0. NOTE: This command should be the same when you test the policy on the car.

2) Label, Pred: These are the control signals of the car, mapped from -255,255 to -1,1. The label is obtained by logging the control that was used to drive the car. The prediction is what the network predicts to be the correct value given an image.

3) Clipping for image display: this is due to the data augmentation which results in some image values outside the valid range. You can just ignore this.

Now a few comments that will hopefully help you to get it to work. 1) The same motor value of 0.23 is a problem. This should not happen. Please try to delete the files in the sensordata folder that were generated ("matched..."). When you run the Jupyter notebook again, they will be regenerated. 2) In general the label values seems very low. We have used the "Fast" mode for data collection. I would recommend to do the same. Note that in lines 43-45 of the dataloader.py file, value are normalized into the range -1,1.

    def get_label(self, file_path):
        index = self.index_table.lookup(file_path)
        return self.cmd_values[index], self.label_values[index]/255

For the "Normal" mode, the maximum is capped at 192. For the "Slow" mode at 128.

3) Depending on the difficulty of the task, you may have to collect significantly more data. Could describe in a bit more detail, your data collection process and the driving task? Also, you may need to train for more epochs.

Hope this helps. Please keep me updated.

chilipeppr commented 4 years ago

That is super helpful. I think my earlier training might have been closer to correct where I just left the Cmd at 0, but I did train in Normal mode so all of my speeds being played back were really low but did seem to try to change higher or lower as I manually moved the camera around to follow the path. The values just never quite reached high enough to get the motors moving. I would say they lingered in the 0.1 range and maybe got to 0.2 as I moved the camera around. I even wrote code to amplify the speeds later, but that didn't quite work. I think I'll try to just record in Fast and/or make those code changes in dataloader.py.

In terms of how I'm training, I'm just steering the car around my kitchen island over and over in a circle about 10 times to get a full logging to analyze. I figured I'd start simple and at least just get it going in a circle in one direction.

On Mon, Sep 14, 2020 at 3:46 AM thias15 notifications@github.com wrote:

Hi John.

Thank you very much for your detailed issue, I really appreciate it! This makes it much easier to help. First the good news: your procedure it correct. Now let me clarify a few things.

1.

Cmd: This corresponds to a high-level command such as "turn left/right" or "go straight" at the next intersection. It is encoded as -1: left, 0: straight, 1: right. As you pointed out, this command can be controlled with the X, Y, or B buttons on the game controller. If you have LEDs connected, it will also control the left/right indicator signals of the car. These commands are logged in the indicatorLog.txt file. During training, the network is conditioned on these commands. If you approach intersection where car could go left, straight or right it is not clear what it should do based on the image only. This is where these commands come in to clear up these ambiguities. It seems that you just want the car to drive along a path in your house. In this case, I would recommend to just keep this cmd at 0. NOTE: This command should be the same when you test the policy on the car. 2.

Label, Pred: These are the control signals of the car, mapped from -255,255 to -1,1. The label is obtained by logging the control that was used to drive the car. The prediction is what the network predicts to be the correct value given an image. 3.

Clipping for image display: this is due to the data augmentation which results in some image values outside the valid range. You can just ignore this.

Now a few comments that will hopefully help you to get it to work.

The same motor value of 0.23 is a problem. This should not happen. Please try to deleted the files in the sensordata folder that were generated ("matched...").

In general the label values seems very low. We have used the "Fast" mode for data collection. I would recommend to do the same. If you use the "Normal" or "Slow" modes, you will need to make a change in lines 43-45 of the dataloader.py file.

def get_label(self, file_path): index = self.index_table.lookup(file_path) return self.cmd_values[index], self.label_values[index]/255

For the "Normal" mode, replace the 255 by 192. For the "Slow" mode, replace the 255 by 128.

I will update the code to be more user-friendly in the future.

Depending on the difficulty of the task, you may have to collect significantly more data. Could describe in a bit more detail, your data collection process and the driving task? Also, you may need to train for more epochs.

Hope this helps. Please keep me updated.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intel-isl/OpenBot/issues/31#issuecomment-691973909, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23J6WAI25Z2ZB7CDA5TSFXYAXANCNFSM4RLBOAWA .

chilipeppr commented 4 years ago

Ok, here's a video of how I train. I used Fast mode (vs Normal or Slow). I set to AUTOPILOT_F and used NNAPI.

https://photos.app.goo.gl/o6BtAHunDjtj8fMNA

And then here's a video of playing back that training. It still doesn't quite work, but I do seem to be getting more movement in the robot with training in Fast mode vs Normal.

https://photos.app.goo.gl/kCw4DpRN6vPpbtCcA

parixit commented 4 years ago

@chilipeppr super helpful video! It would be great if you could a step-by-step video of your build for complete newbies.

chilipeppr commented 4 years ago

I would love to. I figure I might be one of the first to build one of these outside of the Intel team after the public posting of the project as I happened to have every piece needed already sitting in my home workshop, so no need to wait for shipping. It's hard getting this going without more Youtube videos!

On Mon, Sep 14, 2020 at 9:55 AM Parixit notifications@github.com wrote:

@chilipeppr https://github.com/chilipeppr super helpful video! It would be great if you could a step-by-step video of your build for complete newbies.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intel-isl/OpenBot/issues/31#issuecomment-692183364, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23LWNCEALSQT3WH7IRDSFZDHRANCNFSM4RLBOAWA .

parixit commented 4 years ago

Agreed! This project is daunting but want to do it together with my kids. Waiting on the parts and I had our local library 3D print the parts (even they were interested in the project). I'll look forward to your videos, keep us posted!

chilipeppr commented 4 years ago

Is it possible that with my kitchen island I have to train each turn around the island as a right turn? Meaning turn on Cmd = 0 on the straight parts and then turn on Cmd = 1 as I turn right 4 times?

thias15 commented 4 years ago

@chilipeppr If you would like to contribute with build videos that would be awesome and we would be very happy to include them for others in the README! I realize that a lot of people require much more detailed instructions. We are working to provide more comprehensive documentation, but at the moment I have a lot of other engagements as well. For the time lapse video, I did record video of a complete build, but did not get a change to edit it yet. If you like, I'd be happy to setup a quick call with you to coordinate.

thias15 commented 4 years ago

The predicted control values still seem to be too low. Could you post the figures at the end of training? I'm afraid, the model did not converge properly or overfit. The training and validation loss should both decrease and the direction and angle metrics should both increase.

The task of your choice should be learnable and keeping the indicator command at 0 should be fine since you are driving along a fixed path. However, I suspect that you need to train the model for more epochs and that you need more training data. I would recommend to: 1) Collect maybe 10 datasets with 5 loops each driving around the kitchen block. Start/stop logging should ideally be done while driving along the trajectory. In the video you have recorded, you drive somewhere else at the end before the logging is stopped. This could lead to difficulty during training, especially if there is not a lot of data. 2) Take 8 of these datasets for training and the remaining two for validation. By monitoring the validation metrics, you should get a good idea of when the model is starting to work.

Collecting good/clean data is key to machine learning. I know it is not a lot of fun to collect such data, but it is what makes it work in the end! Keep up the great work. Looking forward to your next update (hopefully with the robot driving autonomously).

chilipeppr commented 4 years ago

Ok, I retrained with 10 datasets -- 8 for the training and 2 for the testing. Each run was 5 to 7 loops around the kitchen island. I turned the noise on for 3 of the dataset runs as well.

Here's a video of how I did the training. It's similar to my first post, but I started logging while in motion. I kept the Cmd=0 (default). https://www.youtube.com/watch?v=W7EHo0Jk02A

On the phone these are the zip files that I copied and extracted to the train_data and test_data folders. Notice they're all around 40MB to 80MB in size which feels correct from a size per training session. Again, I used crop_img.

Here are the 8 training datasets placed into the policy/dataset folder.

Here are the 2 test datasets.

I also ran it at Normal speed, but changed the divider to 192 in dataloader.py from the 255 value in there by default since it assumes Fast mode.

I also did the start/stop logging by hitting the A button on the XBox controller while I was in motion on the robot on the start and stop so I would log no speeds of 0. You can see for the 10 datasets I had almost no frames removed for speed 0. I'm even surprised I ended up with any frames of speed 0 in the output because I don't recall stopping, so that's a bit of a concern.

I ended up with the most amount of frames I've ever trained with.

I ended up with much higher numbers in the Label here than the 0.23 numbers you were worried about in my original post.

Here is the mode.fit output. I'd love to understand what the loss, direction_metric, and angle_metric mean to know whether this output seems reasonable or not.

Here is the Evaluation data.

I'm a little worried about these warnings, but maybe they're ok to ignore.

And then here's the final output with predictions. The motor values in the predictions sure seem better.

However, when I go to run the Autopilot with this new model, it still seems to have failed. The only progress is I now have motor movement. Before the motor values were so low I had no movement. Here's a video of the auto-pilot running and the robot not staying on the path but rather just running into chairs.

https://www.youtube.com/watch?v=a0-0lh7_j0E

chilipeppr commented 4 years ago

Hmm. Do I need to edit any of these variables? My crop_img images are 256x96.

chilipeppr commented 4 years ago

Well, apparently the crop_imgs must be correct because I tried doing a training session with preview_img and when I went to train I got these errors.

thias15 commented 4 years ago

1) crop_imgs is correct 2) Can you try to change the batch size to a value which is as high as possible? We trained with a batch size of 128, but this will most likely require a GPU. If you cannot, you need to decrease the learning rate. From the plots it looks like it is too high for the dataset. 3) I'm not sure if rescaling all the values by 192 would work since the car was never actually driven with the resulting values. Did you mix the "Normal" and "Fast" datasets? 4) In line 29, the fact that the label for all images is the same (0.88,0.73) is definitely problematic as well. (The reverse label is generated by FLIP_AUG). For your task of going around the kitchen block in one direction you should probably set FLIP_AUG to False! 5) If you like you can upload your dataset somewhere and I'll have a look. This would be much quicker to debug.

chilipeppr commented 4 years ago

Here is a link to download my dataset. It's the 10 sessions I ran yesterday based on your initial feedback. 8 of the sessions are in train_data as a Zip file. 2 of the sessions are in test_data as a Zip file.

https://drive.google.com/drive/folders/18MchBUtods4sRerSpaA6eTrtC9DPvpbd?usp=sharing

I just tried training the dataset again with your feedback above:

I changed the batch to 128. I have an Nvidia Geforce GTX on my Surface Book 3 so no problem on the GPU needed for that change.
All of my training was done at the Normal speed, so the 192 divider should be ok. There is no Fast in this dataset.
I turned off FLIP_AUG.

The results still didn't do anything for me. The robot still acts the same way. I did train for 20 epochs this time.

The "best fit" was epoch 2 so that was a lot of wasted CPU/GPU going to 20 epochs.

thias15 commented 4 years ago

I will download the data and investigate. The fact that it reaches perfect validation metrics after two epochs and then completely fails is very strange. Did you also try to deploy the last.tflite model or run it on some test images to see if the predictions make sense?

thias15 commented 4 years ago

When I visualize your data, I do see variation in the labels as expected. Do you still see all the same labels?

chilipeppr commented 4 years ago

Yeah, in my training my labels are still all the same. So this does seem messed up.

chilipeppr commented 4 years ago

On your question "Did you also try to deploy the last.tflite model" I did and it was the same failure. It just kept showing a motor value around 0.75 on both left and right motors, sometimes jumping to 0.8 and it would just drive right into chairs/walls.

thias15 commented 4 years ago

This is definitely a problem. In the best case the network will learn this constant label. Did you make any changes to the code? I'm using the exact code from the Github repo with no changes (except FLIP_AUG = false in cell 21). In case you made changes, could you stash them or clone a fresh copy of the repo? The put the same data you uploaded into the corresponding folders and see if you can reproduce what I showed in the comment above.

chilipeppr commented 4 years ago

I haven't changed any of the code. I did try that last run with the batch size changed and FLIP_AUG = false. I also tried epoch=20. I did change dataloader.py to divide by 192. Other than that the code is totally the same. I can try to re-check out the repo, but I don't think that's going to change much.

One thing I'm trying right now is to create a new conda environment with tensorflow instead of tensorflow-gpu as the library.

chilipeppr commented 4 years ago

Why do I get clipping errors and you don't for utils.show_train_batch?

thias15 commented 4 years ago

I also get the clipping errors, I just scrolled down so more images with labels are visible. I'm currently running tensorflow on CPU on my laptop without GPU. It will take some time. But it should not make any difference. For the paper all experiments were performed on a workstation with a GPU. One difference is that I only used Mac and Linux. Maybe there is a problem with Windows for the way the labels are looked up? From the screenshots it seems you're on Windows.

thias15 commented 4 years ago

One thing you could try is running everything in the Linux subsystem of Windows.

chilipeppr commented 4 years ago

Yes, I'm on Windows. Surface Book 3 with Nvidia GPU.

thias15 commented 4 years ago

I'll update you in about 30-60 minutes regarding training progress. But it seems that your issue is the label mapping. I suspect at this point it is related to Windows. As I mentioned, you could try to run the code in the Linux subsystem in Windows. I will also see if I can run it in a VM or setup a Windows environment for testing.

chilipeppr commented 4 years ago

I'm wondering, if you get a final best.tflite file out of your run if you could send that to me to try out on the robot.

I hear you on the label mapping. Could this possibly be something as dumb as Windows doing CR/LF and Mac/Linux using LF?

thias15 commented 4 years ago

Hello. It finished training for 10 epochs now. The plots look reasonable, so why don't you give it a try. To achieve good performance usually some hyperparameter tuning, more data and more training time is needed. But let's see. best.tflite.zip

thias15 commented 4 years ago

notebook.html.zip This is the complete output of my Jupyter notebook to give you some idea how the output should look like. When I get a chance, I will explore the issue you encounter in a Windows environment. It could be something like CR/LF vs LF, but since the code relies on the os library, these types of things should be taken care of. I don't know, but will let you know what I discover. Thanks for your patience. I really want you to be able to train your own models and will try my best to figure out the issue you are encountering.

thias15 commented 4 years ago

Note that both files need to be unpacked. I had to zip them in order to upload them here.

chilipeppr commented 4 years ago

I just tried running your best.tflite and it does not work any better. The robot still runs into walls.

thias15 commented 4 years ago

Does it have a tendency to turn to the right as expected by the test images?

chilipeppr commented 4 years ago

Yes, it does seem to tend slightly to the right as it's driving.

thias15 commented 4 years ago

The predicted values are not quite large enough. More data, training time and some hyperparameter tuning should greatly improve things. We also have not trained our models at the "Normal" speed before, so not sure if this could also have some effect.

thias15 commented 4 years ago

Can you check line 11 of you notebook? How do you define the base_dir?

chilipeppr commented 4 years ago

What if I try the Cmd = -1, 0, 1 stuff with hitting X/Y/B at the appropriate time? I suppose I'd just be hitting B to turn right and Y to go straight since I just keep taking right turns. Make some training data that way and see how it goes?

Recall I did make a few datasets earlier with the Fast mode being used and still didn't see good results.

thias15 commented 4 years ago

@chilipeppr since your label values are messed up, I expect that all of your previous training runs did not render any models that have actually learnt something useful. Adding these cmd value will likely not help and would also require you to apply those cmds during test time.

chilipeppr commented 4 years ago

base_dir = "./dataset"

thias15 commented 4 years ago

In unix environments the current directory is define as ./ However, in Windows it is defined as .\

The current directory is denoted by the "." But the point is that the slashes to define subsequent directories or different.

thias15 commented 4 years ago

Can you try to change it, run it again and see if you get reasonable labels?

chilipeppr commented 4 years ago

Ok, trying right now...

chilipeppr commented 4 years ago

Nope. Same problem. Here's my line 11:

chilipeppr commented 4 years ago

Are you sure my train_ds isn't just getting indexed such that the left/right pairs are next to eachother in memory and your next() and iter() are just returning stuff in an ordered way? But maybe on other computers the index is in a different order?

thias15 commented 4 years ago

I believe the issue is in the data loader. Basically, I build a dictionary that uses the frame paths as key to the labels. This is not really the best way of doing it, but worked fine. I suspect that using the path as key leads to a problem in Windows.

thias15 commented 4 years ago

It's already late here. I'll see if I can get a Windows setup tomorrow to figure out the issue.

chilipeppr commented 4 years ago

Ok, looks like you were right. The left/right data is identical in each line that you read. See this debug output. I just threw in a debug statement at the end of that loop.

chilipeppr commented 4 years ago

Wait, no, that's not true. The data is fine. The way I drive to collect data is I use Video Game mode where I use RT to drive straight without touching any other joysticks. Then when I go to turn I keep RT held while I nudge the joystick. So, much of the data has matching left/right vals, but not all of it. So perhaps this is really just a coincidence. As you can see when you move further into the data the left/right are different.

thias15 commented 4 years ago

Yes, but the samples for visualization are randomly drawn. It is very unlikely that all labels are the same. And this does not happen in my case. There seems to be something wrong with the dictionary. From the debug output, could it be that base_dir needs to be .\\dataset or simply dataset if that works in Windows?

thias15 commented 4 years ago

One more idea. Can you replace line 24 in the data loader.py with this line and see if that fixes the labels. lines = data.replace("\","/").replace(","," ").replace("\t"," ").split("\n")

or

lines = data.replace("\\","/").replace(","," ").replace("\t"," ").split("\n")

chilipeppr commented 4 years ago

Hmm. I did that and now have this output, which looks good to match Mac/Linux, but my images are still all the same labels.

Here's line 24. I also added a "\r" to be replaced by nothingness just in case that was messing things up. If there isn't a "\r" it gracefully moves on.

Here's debug output.

Sadly, the images have the same labels still.

thias15 commented 4 years ago

Hello. I'm working on the Windows setup. In the mean-time, I have trained two models on my workstation with GPU for 100 epochs. Can you try if these models work better? bz16_lr1e-4.zip bz128_lr3e-4.zip

isl-org / OpenBot

Can't get Autopilot to train correctly #31