joshi-bharat / deep_underwater_localization

Source Code for "DeepURL: Deep Pose Estimation Framework for Underwater Relative Localization", IROS 2020
GNU General Public License v3.0
24 stars 11 forks source link

About how to make labels for my own auv model #4

Closed julingers closed 2 years ago

julingers commented 3 years ago

@joshi-bharat Sorry, I'm here again. Now I'm going to apply the deepURL to my AUV. I have some questions when making labels.

The aqua_glass_removed.ply file is only used during testing, isn't it? I don't seem to see the use of this file in train.py. So what does it do? I have my own SolidWorks model of AUV, so do I just need to export to ply format and replace your ply file in the code? I'm a little confused.

Another problem, If I get the pose from the simulation and get the 3D coordinates of 8 key points, do I still need to use the camera's internal parameter matrix to obtain the 2D projection? Or do I get the 2D projection of 8 key points directly from the simulation? I'm a little confused.

One more question, the checkpoint of Yolo V3 is also used in train.py. If it is applied to my own AUV, do I need to find a tensorflow code of Yolo V3 to realize object detection and train to get the weight to identify my own AUV?

I am looking forward to your reply. Thanks!

joshi-bharat commented 3 years ago

The 3D model generates training data through simulation.

To generate the training data, we used Unreal Engine Simulator. Specifically, we take pictures of AUV from different viewpoints in simulation by loading 3D model. Now, we clearly know the pose of the AUV (viewpoints) with respect to the camera. Then, projecting 3D model with pose will give the labels as explained in https://github.com/joshi-bharat/deep_underwater_localization/blob/master/label_file_creation.md

The simulator gives us a pose and corresponding image projected onto random images. The problem is this software is proprietary of Independent Robotics and is not open-source. You can specify how to snapshot and fix the camera matrix for observing the camera in the simulator.

Regarding YOLOV3, the object detection code is already here. So, you just need AUV images with the correct pose.

julingers commented 3 years ago

The 3D model generates training data through simulation.

To generate the training data, we used Unreal Engine Simulator. Specifically, we take pictures of AUV from different viewpoints in simulation by loading 3D model. Now, we clearly know the pose of the AUV (viewpoints) with respect to the camera. Then, projecting 3D model with pose will give the labels as explained in https://github.com/joshi-bharat/deep_underwater_localization/blob/master/label_file_creation.md

The simulator gives us a pose and corresponding image projected onto random images. The problem is this software is proprietary of Independent Robotics and is not open-source. You can specify how to snapshot and fix the camera matrix for observing the camera in the simulator.

Regarding YOLOV3, the object detection code is already here. So, you just need AUV images with the correct pose.

@joshi-bharat I really appreciate for your guidance.

I found that the test results after training with your synthetic dataset are very bad, and I found that there is something wrong with line https://github.com/joshi-bharat/deep_underwater_localization/blob/master/model.py#L103, it should be feature_map_23 = slim.conv2d(feature_map_23, ……)while yours is feature_map_23 = slim.conv2d(feature_map_3, ……). Is that right?

And another question I want to ask you is how to continue training based on the checkpoint I have trained. What I understand is that the training is now based on yolo v3.ckpt to start training. If I replace yolo v3.ckpt with my own trained checkpoint, some errors will occur, so what should I do to continue training with the results of the previous training?

joshi-bharat commented 3 years ago

I am not sure what you mean by results are bad. https://github.com/joshi-bharat/deep_underwater_localization/blob/master/model.py#L103 can you specify what you think is wrong in this line?

You can load your own checkpoint from https://github.com/joshi-bharat/deep_underwater_localization/blob/e55c2738f16b2ef50697aa20152ac84b7f6c9637/args.py#L12

julingers commented 3 years ago

I am not sure what you mean by results are bad. https://github.com/joshi-bharat/deep_underwater_localization/blob/master/model.py#L103 can you specify what you think is wrong in this line?

@joshi-bharat When I read the code of the model.py, I feel a little confused about the net structure. Regarding the network structure of deepURL, it is divided into yolo detection and posture regression. Right? But in https://github.com/joshi-bharat/deep_underwater_localization/blob/master/model.py#L103, the codes show that the output feature map23 is obtained by convolving feature map3 while feature map21 and feature map 22 are different from this behavior. So I'm a bit unsure that it should be _feature_map_23 = slim.conv2d(feature_map_23, ……)_ while yours code is _feature_map_23 = slim.conv2d(feature_map_3, ……)_. Is that right? I'm a little confused.

joshi-bharat commented 3 years ago

The network does not regress to pose directly. The pose regression is somewhat similar to https://arxiv.org/abs/1711.08848. Example, feature_map_21 predicts the location of 2D keypoints (projection of 3D model), also conf of each prediction and class of object. That's why the size is 3 (x, y, conf) * no_of_vertices (8) + class (1). Now, feature_map_22 and feature_map_23 also give the same thing at different scales.

How to use these predictions to calculate loss is explained in https://github.com/joshi-bharat/deep_underwater_localization/blob/master/pose_loss.py.

julingers commented 3 years ago

The network does not regress to pose directly. The pose regression is somewhat similar to https://arxiv.org/abs/1711.08848. Example, feature_map_21 predicts the location of 2D keypoints (projection of 3D model), also conf of each prediction and class of object. That's why the size is 3 (x, y, conf) * no_of_vertices (8) + class (1). Now, feature_map_22 and feature_map_23 also give the same thing at different scales.

How to use these predictions to calculate loss is explained in https://github.com/joshi-bharat/deep_underwater_localization/blob/master/pose_loss.py.

yeah, by the paper, I know the net predicts the location of 2D keypoints and then gets the pose by pnp. I just get confused about the feature_map23 is obtained by feature_map3 through once conv2d, while feature_map21 and feature_map22 are obtained by yolo_block then reshape to 3 (x, y, conf) * no_of_vertices (8) + class (1). image That's what I mean. Forgive me for the expression is not clear enough.

julingers commented 3 years ago

@joshi-bharat Thank you for your kind help. I will write you in my graduation thesis in the future. If I graduate smoothly, hahaha.

joshi-bharat commented 3 years ago

https://github.com/joshi-bharat/deep_underwater_localization/blob/e55c2738f16b2ef50697aa20152ac84b7f6c9637/model.py#L82 feature_map_21 is obtained from route3. So these feature maps are obtained by branching from the Darknet-53 backbone.

julingers commented 3 years ago

https://github.com/joshi-bharat/deep_underwater_localization/blob/e55c2738f16b2ef50697aa20152ac84b7f6c9637/model.py#L82

feature_map_21 is obtained from route3. So these feature maps are obtained by branching from the Darknet-53 backbone. image

yeah, and feature_map22 is also obtained from route2, but about feature_map23, in the code it's obtained by feature_map3 and then through once conv2d, it's a little strange. Right?

joshi-bharat commented 3 years ago

Yup, look like you are correct. I might have somehow pushed this with the bug. It should be feature_map_23. I will update that thanks.

joshi-bharat commented 3 years ago

@julingers let me know how your training went

julingers commented 3 years ago

@julingers let me know how your training went

I trained 3 times completely. First, I have not change any default parameters include feature_map_3. But when I test, the result of my train weights are not good. I don't know why, maybe loss is too large? And your given weights in 3 test(single,list,video) have good results. I don't know where the problem is. feature_map_3, batch size=4,total_epoch=125,based on yolo v3.ckpt. image image image image image imageimage image Second, I change the feature_map3 to feature_map_23,and then train again. feature_map_23,batch size=16,total_epoch=101,based on yolo v3.ckpt. image image image image image image Third, I changed the restore path, and that the restore_include and restore_exclude are set to None. Then I found the loss starts not converge. feature map3,batch size=16,total_epoch=101,based on fisrt train epoch120 image image image Maybe I need to restore the global_step, and change the learning_rate_init, but it seems not work well. I found when I restore last train weights and the learning_rate_init, the regression loss have a shake no matter how I lower the learning_rate_init. And the same time the yolo loss is very small. Can help me, I need your help!

joshi-bharat commented 3 years ago

Did you trained with synthetic or rendered image?

How many images were there in synthetic dataset?

On Fri, Aug 27, 2021, 9:54 PM juling @.***> wrote:

@julingers https://github.com/julingers let me know how your training went

I trained 3 times completely. First, I have not change any default parameters include feature_map_3. But when I test, the result of my train weights are not good. I don't know why, maybe loss is too large? And your given weights in 3 test(single,list,video) have good results. I don't know where the problem is. feature_map_3, batch size=4,total_epoch=125,based on yolo v3.ckpt. [image: image] https://user-images.githubusercontent.com/59331333/131201834-531b1915-da99-433a-8077-82f9467c9676.png [image: image] https://user-images.githubusercontent.com/59331333/131201835-8751f390-2137-4358-aa8b-ea5b88055bf4.png [image: image] https://user-images.githubusercontent.com/59331333/131201838-651029d1-adc8-47ae-8e89-110359c82e84.png [image: image] https://user-images.githubusercontent.com/59331333/131201957-70cfc46d-c3c6-46b8-8f5d-a3f5cf8a2479.png [image: image] https://user-images.githubusercontent.com/59331333/131201961-45095ff6-4162-4284-b8a6-6a0f897bc583.png [image: image] https://user-images.githubusercontent.com/59331333/131202209-dd064e67-450f-46b3-b313-ad491abf5db1.png[image: image] https://user-images.githubusercontent.com/59331333/131202132-5b7ff7dc-c5a1-49e2-aea9-27c9d5f908af.png [image: image] https://user-images.githubusercontent.com/59331333/131202133-37c13a46-f41d-49d0-bbf9-b3dd0c11c2a9.png Second, I change the feature_map3 to feature_map_23,and then train again. feature_map_23,batch size=16,total_epoch=101,based on yolo v3.ckpt. [image: image] https://user-images.githubusercontent.com/59331333/131201983-16d79e69-7c60-492f-9978-0ee6c57caceb.png [image: image] https://user-images.githubusercontent.com/59331333/131201987-01790197-c264-4f31-9fe7-60f135d2167e.png [image: image] https://user-images.githubusercontent.com/59331333/131201992-27b10c1c-ca5a-4615-b74c-f10713db1acd.png [image: image] https://user-images.githubusercontent.com/59331333/131202139-d14412e0-1580-4a45-9324-27364578972f.png [image: image] https://user-images.githubusercontent.com/59331333/131202230-279479fc-0444-4e25-b579-cdb28f6b37e8.png [image: image] https://user-images.githubusercontent.com/59331333/131202233-fd4ef5f1-acfd-47a0-92e5-b2f6ea106d7c.png Third, I changed the restore path, and that the restore_include and restore_exclude are set to None. Then I found the loss starts not converge. feature map3,batch size=16,total_epoch=101,based on fisrt train epoch120 [image: image] https://user-images.githubusercontent.com/59331333/131202393-93e8f6f6-229f-489a-a2c0-982bae2c690d.png [image: image] https://user-images.githubusercontent.com/59331333/131202392-67c70332-0518-4059-96f1-4dfee0a6805a.png [image: image] https://user-images.githubusercontent.com/59331333/131202397-1667d603-bf2f-4a0f-9bb0-f0d0b86ded5e.png Maybe I need to restore the global_step, and change the learning_rate_init, but it seems not work well. I found when I restore last train weights and the learning_rate_init, the regression loss have a shake no matter how I lower the learning_rate_init. And the same time the yolo loss is very small. Can help me, I need your help!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joshi-bharat/deep_underwater_localization/issues/4#issuecomment-907549015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFTKBDV7Y2E22YBACPUOMLT7A6WLANCNFSM5BYKXCTA .

joshi-bharat commented 3 years ago

I already changed the line to feature_map_23. There is no restore path for other than yolov3 as there was no pre training for them.

On Fri, Aug 27, 2021, 9:56 PM Bharat Joshi @.***> wrote:

Did you trained with synthetic or rendered image?

How many images were there in synthetic dataset?

On Fri, Aug 27, 2021, 9:54 PM juling @.***> wrote:

@julingers https://github.com/julingers let me know how your training went

I trained 3 times completely. First, I have not change any default parameters include feature_map_3. But when I test, the result of my train weights are not good. I don't know why, maybe loss is too large? And your given weights in 3 test(single,list,video) have good results. I don't know where the problem is. feature_map_3, batch size=4,total_epoch=125,based on yolo v3.ckpt. [image: image] https://user-images.githubusercontent.com/59331333/131201834-531b1915-da99-433a-8077-82f9467c9676.png [image: image] https://user-images.githubusercontent.com/59331333/131201835-8751f390-2137-4358-aa8b-ea5b88055bf4.png [image: image] https://user-images.githubusercontent.com/59331333/131201838-651029d1-adc8-47ae-8e89-110359c82e84.png [image: image] https://user-images.githubusercontent.com/59331333/131201957-70cfc46d-c3c6-46b8-8f5d-a3f5cf8a2479.png [image: image] https://user-images.githubusercontent.com/59331333/131201961-45095ff6-4162-4284-b8a6-6a0f897bc583.png [image: image] https://user-images.githubusercontent.com/59331333/131202209-dd064e67-450f-46b3-b313-ad491abf5db1.png[image: image] https://user-images.githubusercontent.com/59331333/131202132-5b7ff7dc-c5a1-49e2-aea9-27c9d5f908af.png [image: image] https://user-images.githubusercontent.com/59331333/131202133-37c13a46-f41d-49d0-bbf9-b3dd0c11c2a9.png Second, I change the feature_map3 to feature_map_23,and then train again. feature_map_23,batch size=16,total_epoch=101,based on yolo v3.ckpt. [image: image] https://user-images.githubusercontent.com/59331333/131201983-16d79e69-7c60-492f-9978-0ee6c57caceb.png [image: image] https://user-images.githubusercontent.com/59331333/131201987-01790197-c264-4f31-9fe7-60f135d2167e.png [image: image] https://user-images.githubusercontent.com/59331333/131201992-27b10c1c-ca5a-4615-b74c-f10713db1acd.png [image: image] https://user-images.githubusercontent.com/59331333/131202139-d14412e0-1580-4a45-9324-27364578972f.png [image: image] https://user-images.githubusercontent.com/59331333/131202230-279479fc-0444-4e25-b579-cdb28f6b37e8.png [image: image] https://user-images.githubusercontent.com/59331333/131202233-fd4ef5f1-acfd-47a0-92e5-b2f6ea106d7c.png Third, I changed the restore path, and that the restore_include and restore_exclude are set to None. Then I found the loss starts not converge. feature map3,batch size=16,total_epoch=101,based on fisrt train epoch120 [image: image] https://user-images.githubusercontent.com/59331333/131202393-93e8f6f6-229f-489a-a2c0-982bae2c690d.png [image: image] https://user-images.githubusercontent.com/59331333/131202392-67c70332-0518-4059-96f1-4dfee0a6805a.png [image: image] https://user-images.githubusercontent.com/59331333/131202397-1667d603-bf2f-4a0f-9bb0-f0d0b86ded5e.png Maybe I need to restore the global_step, and change the learning_rate_init, but it seems not work well. I found when I restore last train weights and the learning_rate_init, the regression loss have a shake no matter how I lower the learning_rate_init. And the same time the yolo loss is very small. Can help me, I need your help!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joshi-bharat/deep_underwater_localization/issues/4#issuecomment-907549015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFTKBDV7Y2E22YBACPUOMLT7A6WLANCNFSM5BYKXCTA .

julingers commented 3 years ago

yup, I trained with synthetic image in the synthetic folder I download. And after delete some tags, I trained 26902 images in the synthetic folder.

joshi-bharat commented 3 years ago

There should be actually more. I will check if there should be more images tomorrow?

Did you tried just loading the checkpoint and do inference?

On Fri, Aug 27, 2021, 10:03 PM juling @.***> wrote:

yup, I trained with synthetic image in the synthetic folder I download. And after delete some tags, I trained 26902 images in the synthetic folder.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joshi-bharat/deep_underwater_localization/issues/4#issuecomment-907550116, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFTKBHW4RXK2DBY57OAFFTT7A7XPANCNFSM5BYKXCTA .

julingers commented 3 years ago

I already changed the line to feature_map_23. There is no restore path for other than yolov3 as there was no pre training for them.

Yup, I changed the feature_map_3 to feature_map_23, and trained based on yolov3.ckpt, then I restored this new one.

julingers commented 3 years ago

Did you tried just loading the checkpoint and do inference?

yes, I done inference using test_single_image and test_list_image and test_video. But I just didn't get good results.

julingers commented 3 years ago

@joshi-bharat Hi, I want to ask you how many images you used to train in synthetic folder about the checkpoint you have given which has a good result. I wonder if the number of train images influence the leaning rate because of the piecewise constant attenuation method.

joshi-bharat commented 3 years ago

@julingers I was using 37000 images to train. Looks like while uploading or creating zip file, it was incomplete somehow. I will try to upload the complete images tomorrow. I will let you know.

The piecewise learning rate is based on a number of epochs. https://github.com/joshi-bharat/deep_underwater_localization/blob/d1b2bcb0b139b8937fc751419c82e4df33d40927/args.py#L43

julingers commented 3 years ago

@joshi-bharat Thanks! I will continue to try it.

julingers commented 3 years ago

@joshi-bharat About the given checkpoint which has a good result, I‘d like to ask what your final loss is. I wonder if my loss is too large after trained 125 epochs. And how much should the loss be so that I can get good inference results. Now my loss is 800 approximately, and it didn’t get a good test results.

joshi-bharat commented 3 years ago

I actually don't remember what my final loss was. But I guess it was a single digit. I am not exactly sure.

joshi-bharat commented 3 years ago

Most likely I will put complete data on google drive today. After the complete data, the list of images in training should match.

julingers commented 3 years ago

okay,my loss is still too big. I will try to train again. Hope that I can get good results.

joshi-bharat commented 3 years ago

okay,my loss is still too big. I will try to train again. Hope that I can get good results.

Uploaded the complete dataset. Now there should be 37000 images.

julingers commented 3 years ago

Uploaded the complete dataset. Now there should be 37000 images.

@joshi-bharat I still want to ask a question. When I continue training based on lasted checkpoint, my loss just doesn’t converge.

When you were training before,have you tried restoring the last checkpoint which loss is relatively large, and continue training on this basis? Or you started with yolov3.ckpt every time you train? I found that the loss converged every time if I trained on the basis of yolov3.ckpt while based on lasted checkpoint I trained not converged.

So I want to ask that if I restore one checkpoint. Apart from restoring the global step and learning rate initial, do other parameters need to be changed if I continue training on the basis of a pre-trained checkpoint.

joshi-bharat commented 3 years ago

If I remember correctly, I always trained from yolo checkpoint. I guess try to run for 200 epochs so that you won't need to restore the checkpoint.

julingers commented 3 years ago

In that case, each training session should take a long, long time. Because the dataset is too large. If I adjust one parameter, oh I need to wait one week to get result.

joshi-bharat commented 3 years ago

I was running for 125 epochs in a day with RTX 2080. So, should not be that difficult. You can first try for 125 epochs or so.

julingers commented 3 years ago

I was running for 125 epochs in a day with RTX 2080. So, should not be that difficult. You can first try for 125 epochs or so.

I vaguely remember that I was runing for 125 epochs more than 2days using Tesla P100 GPU 16G for 26902 synthetic images, and the final loss is 800 approximately. Oh, why is it so, and my batch size is 4 which I didn't change. Now, if I used 37000 images and 200 epochs, it needs to take more time.

joshi-bharat commented 3 years ago

Increase the batch size to 16 or even 32. Since it performs batch normalization, always better to set the batch size to the maximum value your GPU memory can support.

This might also be the reason for non-convergence as a smaller batch size increases variance in the training process. Keep increasing batch size until it says not enough memory or memory is full using nvidia-smi tool.

julingers commented 3 years ago

yup, I had tried to increase batch size to 16, which is the maximum capacity that the gpu can support. Now,the situation is that only based on loading pre-trained checkpoint can result to non-convergence enven I changed the global step and leaning rate initial at the same time. But started from yolov3.ckpt to train that can get a convergent loss, while it's just a bit big so that it didn't get a good inference result.

joshi-bharat commented 3 years ago

yup, I had tried to increase batch size to 16, which is the maximum capacity that the gpu can support. Now,the situation is that only based on loading pre-trained checkpoint can result to non-convergence enven I changed the global step and leaning rate initial at the same time. But started from yolov3.ckpt to train that can get a convergent loss, while it's just a bit big so that it didn't get a good inference result.

I am not sure why you are not getting good results. Looks like I was using a batch size of 8. I had pretty fast convergence as well. I think you should try to run for 100 epochs with all the data.

I do not have the setup to test things now as I am not working on this project currently.

julingers commented 3 years ago

@joshi-bharat Okay, thank you very much for your help. You give me a lot of encouragement. Thanks!

julingers commented 3 years ago

@joshi-bharat Hi, I found one problem that the 3d bounding box center point is the first point in your label file lines. And I tested it by drawing 3d bounding box manually through your draw_demo_img_corners method.

But, in the training codes, you get the first 8 points as 8 corners to compute loss. Refer to https://github.com/joshi-bharat/deep_underwater_localization/blob/master/pose_loss.py#L189-L190. So I think it looks like something is wrong.

Actually, the center point should be placed at the end for every lines while it is in the first place now.

joshi-bharat commented 3 years ago

Looking at the code center should be at the end or can be removed totally. I only put it there for reference.

I could not test this as I don't have the setup but you can try to test it. I will change the labels if this works.

Thanks again.

julingers commented 2 years ago

@joshi-bharat It should be the wrong location of the center point in the label file that caused the poor inference result. This result I saved is for your checkpoint using test_image_list.py. 图片2

And this result is for the checkpoint which trained with 26902 synthetic images. 图片1

The result of my training checkpoint seems a bit close to yours now.

julingers commented 2 years ago

I have a question, can the final loss be reduced to single digits? Now, my batch_size is 16, the loss is around 3000 in the end. how can I reduce my loss smaller, do you have some suggestions? Or maybe I'm stuck in a local minimum. Or the loss can only reach this value?I'm not sure.

This is the loss trend for the test result shown above this comment. 图片1 Hope that you can help me.

joshi-bharat commented 2 years ago

I am not sure even if I got a single-digit loss. I believe the better way would be to just try on the test dataset. I did not have to do anything else.