train parameters size mismatch

kebijuelun commented 2 years ago

follow testing steps, but meet the following error. It seems that the model parameters do not correspond to the model definition.

/data/github_code/ai-imu-dr/src/main_kitti.py in launch(args)
     29 
     30     if args.test_filter:
---> 31         test_filter(args, dataset)
     32 
     33     if args.results_filter:

/data/github_code/ai-imu-dr/src/main_kitti.py in test_filter(args, dataset)
    427     from IPython import embed; embed()
    428 
--> 429     torch_iekf.load(args, dataset)
    430     iekf.set_learned_covariance(torch_iekf)
    431 

/data/github_code/ai-imu-dr/src/utils_torch_filter.py in load(self, args, dataset)
    461         if os.path.isfile(path_iekf):
    462             mondict = torch.load(path_iekf)
--> 463             self.load_state_dict(mondict)
    464             cprint("IEKF nets loaded", 'green')
    465         else:

~/miniconda3/envs/dfvo/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    775         if len(error_msgs) > 0:
    776             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 777                                self.__class__.__name__, "\n\t".join(error_msgs)))
    778         return _IncompatibleKeys(missing_keys, unexpected_keys)
    779 

RuntimeError: Error(s) in loading state_dict for TORCHIEKF:
        Unexpected key(s) in state_dict: "mes_net.cov_net.8.weight", "mes_net.cov_net.8.bias", "mes_net.cov_net.12.weight", "mes_net.cov_net.12.bias", "mes_net.cov_net.16.weight", "mes_net.cov_net.16.bias". 
        size mismatch for mes_net.cov_net.4.weight: copying a param with shape torch.Size([64, 32, 5]) from checkpoint, the shape in current model is torch.Size([32, 32, 5]).
        size mismatch for mes_net.cov_net.4.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).

torch version

torch                              1.1.0     
torchvision                        0.3.0

kimms74 commented 2 years ago

make continue_training False in class KITTIArgs() of main_kitty.py

hmf21 commented 2 years ago

make continue_training False in class KITTIArgs() of main_kitty.py

It still have the same error for parameter mismatching after setting the continue_training False. Would you have any other idea about this problems?

lumyus commented 2 years ago

Same issue here

scott81321 commented 2 years ago

I also get something very similar: RuntimeError: Error(s) in loading state_dict for TORCHIEKF: Unexpected key(s) in state_dict: "mes_net.cov_net.8.weight", "mes_net.cov_net.8.bias", "mes_net.cov_net.12.weight", "mes_net.cov_net.12.bias", "mes_net.cov_net.16.weight", "mes_net.cov_net.16.bias". size mismatch for mes_net.cov_net.4.weight: copying a param with shape torch.Size([64, 32, 5]) from checkpoint, the shape in current model is torch.Size([32, 32, 5]). size mismatch for mes_net.cov_net.4.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).

Part of the problem goes away if you adjust the sizes in mesnet but either I cannot find (so far) make the right size adjustments to make completely the problem go away,

=> This happens if path_iekf finds the file ../temp/iekfnets.p However, if it is not there the program carries and I still get the beautiful plot as shown in Github namely the route segment of file 2011_09_30_drive_0028_extract

lumyus commented 2 years ago

Same here. If you put "train_filter = 1" and then back to "train_filter = 0" the IEKF will be loaded however that's not the trained model which was specificed at the URL in the Readme. Also changing the network parameters (shape) did not work for me. Any idea on whats going on here @mbrossar ?

scott81321 commented 2 years ago

I am still working with the original default at train_filter = 0 and test_filter=1. I cannot complain because the curves obtained are BEAUTIFUL (Merci Martin) but I realize that some training had to be used to get that BEAUTIFUL curve. The code reads in ../temp/normalize_factors.p and needs them

I suppose an adjacent question is: how to get the curves with pure IEKF and no help from the AI or CNN part?

hmf21 commented 2 years ago

According to @scott81321 , I delete the ../temp/iekfnets.p and also get some curves which seem to be generated from the mesnet with randomly initialized parameters. And refer to the paper, mesnet is composed of two Conv layers but the iekfnets.p gives a model with five layers. Is there anything wrong with the implement?

scott81321 commented 2 years ago

Hello @hmf17 can you read the contents of iekfnets.p? What little I know is that they contain the CNN (mes.net). I can get a picture of it using netron but can you give me the python instructions to read the contents?

CNN_model

hmf21 commented 2 years ago

Hi @scott81321 , I use torch.load() to read the contents of iekfnets.p and the result is shown in picture below. Although the picture is not intuitive, it seems the stucture is different from your picture which only contains two conv layers. How do you get this diagram? It is very beautiful.

scott81321 commented 2 years ago

Hello @hmf17 Thank you. To get that picture of the CNN, I use a relatively new software called netron. You can use it online https://netron.app/ or download it from Github https://github.com/lutzroeder/netron. You have to create a .pt file inside init in class TORCHIEKF. After the instruction: self.mes_net = MesNet() then save the CNN model with PATH = "...../CNN_model.pt"
torch.save(self.mes_net, PATH) Once you have that, then load it into netron

I do see something weird in the picture you just showed me , dimension indices as high as 128? Your picture is beautiful also. I used torchload() but then followed with a print statement which gives too many details. How did you get the tensor dimensions upfront?

hmf21 commented 2 years ago

Hi @scott81321 , thank you for providing this powerful software. I just simply use Pycharm to see the details in iekfnets.p and you can see the prameter states in the variables toolbar. The max output peature dimension is 128 in this model which is quiet different from the description in the paper. And I still have no progress for running this program, do you have any good idea?

scott81321 commented 2 years ago

Oh! just use the code as originally loaded and remove iekfnets.p from the temp sub-directory [just put iekfnets.p elsewhere]. If it cannot find the file, it gives a print statement [look for cprint("IEKF nets NOT loaded", 'yellow') in utils_torch_filter.py] but carries on nonetheless. The original version that you can download only uses normalize_factors.p [make sure train_filter=0]. I got the code working on the test files producing 10 ensembles of graphs. What I would like to know is how to get the results without the training i.e. pure IEKF because ironically, even though I am clearly NOT loading iekfnets.p, the picture I get for 2011_09_30_drive_0028_extract i.e. file position_xy.png looks like the result enhanced with AI (CNN) not the raw IEKF result.

Please, can you give me the specific Python command(s) to print out the contents of iekfnets.p ??

hmf21 commented 2 years ago

Hi @scott81321 , I just use some simple commands : path_iekf = './temp/iekfnets.p' mondict = torch.load(path_iekf) then I can see the content of the loaded model in Variables toolbar on the right.

scott81321 commented 2 years ago

Thx. Here is what netron gives for iekfnets.pt (note as a pt file)

hmf21 commented 2 years ago

great! @scott81321

lumyus commented 2 years ago

So did anyone get it to work? I mean actually use your own data to get results? The plots seem to be generated no matter what model is used..

scott81321 commented 2 years ago

I got it to work for the datasets downloaded from github. Not on my own data yet. I need to better understand his code. E.g. how to switch on the neural network and not use it i.e. pure IEKF.

lumyus commented 2 years ago

Nice! What did you change? Running the model which is provided by the author does not work..

Hazeline2018 commented 2 years ago

@scott81321 @hmf17 Hi, I wonder how you guys got the program working with training (train_filter = 1), even with the KITTI datasets that Martin originally used? When I read in the datasets, and start training, I got the following error that I have no clue about:

_Sequence name : 2011_09_30_drive_0028_sync

Sequence name : 2011_09_30_drive_0033_sync Dataset is too short (15.94 s)

Sequence name : 2011_09_30_drive_0034_sync Dataset is too short (12.24 s)

Sequence name : 2011_09_30_drive_0072_sync Dataset is too short (0.05 s)

Total dataset duration : 825.41 s IEKF nets NOT loaded Traceback (most recent call last): File "main_kitti.py", line 484, in launch(KITTIArgs) File "main_kitti.py", line 28, in launch train_filter(args, dataset) File "/home/terryl/projects/AI-IMU-DR/ai-imu-dr/src/train_torch_filter.py", line 61, in train_filter prepare_loss_data(args, dataset) File "/home/terryl/projects/AI-IMU-DR/ai-imu-dr/src/train_torch_filter.py", line 108, in prepare_loss_data Rot_gt = torch.zeros(Ns[1], 3, 3) TypeError: zeros() received an invalid combination of arguments - got (NoneType, int, int), but expected one of:

(tuple of ints size, *, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
(tuple of ints size, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requiresgrad)

Hopefully you guys can give me some advice on how to get over this error and most importantly, get the training program working first. I tend to tailor the program toward my application by training model with own datasets if all possible.

I'm using PyTorch 1.0.0 with GPU version

Thanks in advance! Terry

saltrack commented 2 years ago

@kebijuelun @scott81321 @lumyus @hmf17 did you guys make any progress on resolving this problem?

The plots seem pretty good even with randomly initialized parameters.

I've modified the sizes of the layers of the Mesnet which resolved some of the errors but this error continues to persist.

"RuntimeError: mat1 and mat2 shapes cannot be multiplied (47945x64 and 32x2)"

nothing371442 commented 2 years ago

Hi, I also met the proplem of mismatch of mesnet size. When I deleted the iekfnets.p and run the code without CNN, the result looked good. I wonder how can I run the code with CNN? At the mean time, why the result without CNN adapter has been so good? Thanks a lot :)

nothing371442 commented 2 years ago

Hi, I also met the proplem of mismatch of mesnet size. When I deleted the iekfnets.p and run the code without CNN, the result looked good. I wonder how can I run the code with CNN? At the mean time, why the result without CNN adapter has been so good? Thanks a lot :)

The problem of dismatich can be solved, by turning on the train option (set to 1) and it can generate a new iekfnets.p which can be used for test filter.

Rajat-Arora commented 2 years ago

@nothing371442 didn't you get any errors while training as mentioned in #72?

Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

nothing371442 commented 2 years ago

@nothing371442 didn't you get any errors while training as mentioned in #72?

Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

Rajat-Arora commented 2 years ago

@nothing371442 didn't you get any errors while training as mentioned in #72? Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

Yes, I have deleted this file and set the train option (set to 1), but it gives me an error similar to #72.

scott81321 commented 2 years ago

Hi guys. As I can tell there is a mismatch in format between the file iekfnets.p and what CNN format is. Notice that Brossard's default is on test mode, not train mode. I saw discrepancies in the values for the noise covariances of his thesis and what he encoded for the OXTS data files of his test data. This suggests to me that he hardwired these numbers to get the best test results for his test cases and kind of relinquished the training aspect in a pragmatic way. These noise covariances are in the initials ones on main_kitti.py and less importantly in utils_numpy_filter.py I had to modify the ones in main_kitti.py to get the best results for the data given to me.

So I would like to ask all of you: what does iefknets.p contain? Is it only noise covariances? If so, which ones?

nothing371442 commented 2 years ago

@nothing371442 didn't you get any errors while training as mentioned in #72? Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

Yes, I have deleted this file and set the train option (set to 1), but it gives me an error similar to #72.

Hi, did you download the provided delta_p.p file firstly?

nothing371442 commented 2 years ago

Hi guys. As I can tell there is a mismatch in format between the file iekfnets.p and what CNN format is. Notice that Brossard's default is on test mode, not train mode. I saw discrepancies in the values for the noise covariances of his thesis and what he encoded for the OXTS data files of his test data. This suggests to me that he hardwired these numbers to get the best test results for his test cases and kind of relinquished the training aspect in a pragmatic way. These noise covariances are in the initials ones on main_kitti.py and less importantly in utils_numpy_filter.py I had to modify the ones in main_kitti.py to get the best results for the data given to me.

So I would like to ask all of you: what does iefknets.p contain? Is it only noise covariances? If so, which ones? I think it contains net parameters like pic below

Rajat-Arora commented 2 years ago

@nothing371442 didn't you get any errors while training as mentioned in #72? Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

Yes, I have deleted this file and set the train option (set to 1), but it gives me an error similar to #72.

Hi, did you download the provided delta_p.p file firstly?

I was able to figure it out and train the model, there were some issues regarding the version of PyTorch that I was using.

Rajat-Arora commented 2 years ago

Hi @scott81321, could you please describe more about that actually what modifications were done in main_kitti.py to get the best results? Also, you mentioned data given to you, so are you talking about the dataset given to you by the author or your own dataset?

Hi guys. As I can tell there is a mismatch in format between the file iekfnets.p and what CNN format is. Notice that Brossard's default is on test mode, not train mode. I saw discrepancies in the values for the noise covariances of his thesis and what he encoded for the OXTS data files of his test data. This suggests to me that he hardwired these numbers to get the best test results for his test cases and kind of relinquished the training aspect in a pragmatic way. These noise covariances are in the initials ones on main_kitti.py and less importantly in utils_numpy_filter.py I had to modify the ones in main_kitti.py to get the best results for the data given to me.

So I would like to ask all of you: what does iefknets.p contain? Is it only noise covariances? If so, which ones?

Hi @scott81321, could you please describe more about that actually what modifications were done in main_kitti.py to get the best results? Also, you mentioned data given to you, so are you talking about the dataset given to you by the author or your dataset?

scott81321 commented 2 years ago

Hi @scott81321, could you please describe more about that actually what modifications were done in main_kitti.py to get the best results? Also, you mentioned data given to you, so are you talking about the dataset given to you by the author or your own dataset?

Hi guys. As I can tell there is a mismatch in format between the file iekfnets.p and what CNN format is. Notice that Brossard's default is on test mode, not train mode. I saw discrepancies in the values for the noise covariances of his thesis and what he encoded for the OXTS data files of his test data. This suggests to me that he hardwired these numbers to get the best test results for his test cases and kind of relinquished the training aspect in a pragmatic way. These noise covariances are in the initials ones on main_kitti.py and less importantly in utils_numpy_filter.py I had to modify the ones in main_kitti.py to get the best results for the data given to me. So I would like to ask all of you: what does iefknets.p contain? Is it only noise covariances? If so, which ones?

Hi @scott81321, could you please describe more about that actually what modifications were done in main_kitti.py to get the best results? Also, you mentioned data given to you, so are you talking about the dataset given to you by the author or your dataset?

The data is proprietary and I cannot tell you where it came from. It's not OXTS data. That much I can tell you. The IMU sensor is not as high quality. As I said to get the best results, I had to change the noise covariances - variables starting with cov_ in the python files I mentioned. I cannot and will not tell what settings I used, only point out that I had to increase them. To find the best results, I tried many simulations on the same data until I found a range that worked well.

ajay1606 commented 1 year ago

@scott81321 Thank you for your input on most of the queries posted here. Every single comment you posted here is useful in understanding this work. With your support, able to get the following result from the custom dataset.

But still, there are a few parameters that need to tune to get a better result, Has anyone come across with similar situation? appreciate any response.

And I am trying to port it to work in ROS, so we can test in real-time sensor input. I will share once I have completed that.

XY PLOT ALIGNED XY PLOT

scott81321 commented 1 year ago

@ajay1606 What are you asking for? How to improve your results? With all due respect, the aligned picture looks pretty good in terms of agreement. What sensor are you using? Is it high quality? Also what is the resolution of your lat-longs i.e. position? If it's GPS, the accuracy is limited by the number of digits. E.g. 5 digits of lat-longs gives 1.1 meters resolution. 4 digits only gives 11.1 meters. It seems to me, this result is pretty good. The only thing I can think of, to improve it, would be a slight, e.g. adjustment of the initial noise covariances (variables cov_* ) in main_kitti.py. There is also the issue of the INITIAL CONDITIONS i.e. Initial velocity and especially initial RPY. This program is VERY sensitive to initial RPY. E.g. if you're driving a vehicle on a horizontal flat surface, you have to worry about initial Yaw. Roll and pitch should be about zero in this case.

ajay1606 commented 1 year ago

@scott81321 Thank you so much for your quick response. Currently am testing with NOVATEL RTK GNSS + Epson G320N MEMS IMU Model. And Thank you so much for your confirmation and I will try to tune initial noise covariances as you suggested. I agree with you completely, the program is very sensitive to initial RPY.

Thank you so much.

kartikeya13 commented 1 year ago

Hello, Apologies for the newbie question but can anyone tell me what is the difference between XY plot and the aligned XY plot? Thanks

scott81321 commented 1 year ago

As far as I know, the aligned plot is one which tries to align the IEKF computed solution from IMU data with the ground truth (usually GPS values). The XY plot is the plot without that alignment. This alignment is made in utils_plot.py

Akudavale commented 10 months ago

@nothing371442 didn't you get any errors while training as mentioned in #72? Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

Yes, I have deleted this file and set the train option (set to 1), but it gives me an error similar to #72.

Hi, did you download the provided delta_p.p file firstly?

I was able to figure it out and train the model, there were some issues regarding the version of PyTorch that I was using.

@Rajat-Arora hey can you explain what did you do to solve this issue

mbrossar / ai-imu-dr

train parameters size mismatch #69