jstmn / ikflow

Open source implementation to the paper "IKFlow: Generating Diverse Inverse Kinematics Solutions"
https://sites.google.com/view/ikflow/home
Other
51 stars 5 forks source link

I have a question about the negative logarithmic likelihood function during code training. #6

Open llyg777 opened 7 months ago

llyg777 commented 7 months ago

I have a question about the negative logarithmic likelihood function during code training. I found that the negative logarithmic likelihood function I was training had negative values.

Is this caused by my lack of normalization in processing the dataset? I did not modify the code for ikflow.

And when I looked at the code, I found that the neg_log_likeli is defined by this code:

output, jac = self.nn_model.forward(x, c=conditional, jac=True) zz = torch.sum(output*2, dim=1) neg_log_likeli = 0.5 zz - jac loss = torch.mean(neg_log_likeli)

I didn't see the logarithmic operation for zz and jac here when watching Graph INN, nor did I see the softmax layer in the final output of the model. I have been troubled here for a long time, and I really hope to receive your explanation and assistance from the author.

jstmn commented 7 months ago

Hi @llyg777,

  1. negative values for the loss is fine.

  2. i'm not sure what the softmax layer you're referring to is. There's tanh scaling of the scaling and translation terms in the GlowCoupling layer if that's what you're referring to? (https://github.com/vislearn/FrEIA/blob/master/FrEIA/modules/coupling_layers.py#L264)

  3. Here's where the log det of the jacobian is calculated:
    https://github.com/vislearn/FrEIA/blob/master/FrEIA/modules/coupling_layers.py#L304C1-L305C1 i'll get back to you on why that code is correct

    • jeremy
llyg777 commented 7 months ago

Today I found a small error in the code and hope it can be helpful In ikflow/training/lt_model. py Latent should be Latent=draw'latent ("Gaussian", 1, shape, None) Instead of Latent=draw'latent (None, "Gaussian", 1, shape)

And thank you very much for your prompt reply, I will read the formula you mentioned in detail, and then sort out my thoughts and clarify the problem.

llyg777 commented 7 months ago

I also forgot to mention a very important question about training my own robot model. I would like to ask you to train Panda according to the 2080Ti 25 million data in the IKflow paper, is it trained according to the max-epoch=5000 in the previous code? Didn't terminate your training early? I would like to take the liberty of asking how long you have been training once? I used 1080Ti 250,000 small data sets to train the weights of 1000epoch by myself, and the position error was nearly 10cm. It's also been a long time. But I do have no problem with the weight file you provided, the error is normal.

jstmn commented 7 months ago
  1. "Today I found a small error in the code and hope it can be helpful" can you submit a PR to fix this?

  2. To clarify - it sounds like you're trying to recreate the exact results from the IKFlow paper, is that the case? If so i'll need to dig up the exact hyperparameters

otherwise, in order to get the best possible ikflow model, run this: scripts/train.py --robot_name=panda --nb_nodes=12 --coeff_fn_internal_size=1024 --coeff_fn_config=3 --dim_latent_space=7 --batch_size=512 --learning_rate=0.00005 --gradient_clip_val=1 --dataset_tags non-self-colliding

^ this was the run arguments used to get the final panda model used here: https://github.com/jstmn/ikflow/blob/master/ikflow/model_descriptions.yaml#L10

  1. Here's what you can expect the batch-number vs. positional error curve to look like. I recommend to think in terms of the number of batches as opposed to the number of epochs. Epochs don't matter in this context because the data is completely 'uniform' - i.e. there's no finite set of images to iterate over, rather an infinite source of new data Screen Shot 2024-03-07 at 8 42 01 AM
llyg777 commented 6 months ago

My native language is not English, and I hope my expression will not cause any misunderstandings. Answer:

I will submit a PR to address that small issue. But there are still big problems happening afterwards, and if I can solve them, I will upload them. If I can't solve them, I will ask questions. If you have time, could you please answer? If you don't have time, I will learn to modify it. I think this is also improving my poor coding skills. Yes, I am a graduate student studying your paper. I want to try to complete the entire training process, hoping that the results can achieve the accuracy of the paper. This will allow me to compare your paper as a baseline. In fact, I used 0.0.8 ikflow for complete training six months ago, but the accuracy of the training has always been poor, with a L2 error of nearly 10 centimeters. However, as you said, optimizing with TRCK-IK can indeed achieve good accuracy. I now understand that my step is not enough, and the model has not yet found the optimal weight. I understand your answer now. The optimal model weight I need to find is not necessarily the larger the epochs set, the better. It is the optimal weight obtained by the model itself during learning.

My native language is not English, and I hope my expression will not cause any misunderstandings.

Answer: 1、I will submit a PR to address that small issue. But there are still big problems happening afterwards, and if I can solve them, I will upload them. If I can't solve them, I will ask questions.If you have time, could you please answer? If you don't have time, I will learn to modify it. I think this is also improving my poor coding skills. 2、Yes, I am a graduate student studying your paper. I want to try to complete the entire training process, hoping that the results can achieve the accuracy of the paper. This will allow me to compare your paper as a baseline. In fact, I used 0.0.8 ikflow for complete training six months ago, but the accuracy of the training has always been poor, with a L2 error of nearly 10 centimeters. However, as you said, optimizing with TRCK-IK can indeed achieve good accuracy. I now understand that my step is not enough, and the model has not yet found the optimal weight. 3、I understand your answer now. The optimal model weight I need to find is not necessarily the larger the epochs set, the better. It is the optimal weight obtained by the model itself during learning.

Finally, thank you again for your answer, which has helped me a lot!

jstmn commented 6 months ago

Hi,

  1. Thanks for the pr! What issues are you having?

  2. If your going to run an IK refinment step using the robot's jacobian (i.e. TRAC-IK) I recommend using inverse_kinematics_single_step_levenburg_marquardt() which is part of the Robot class (source is here). The robot variable here has this method. It will be much faster than TRAC-IK if running on multiple solutions

  3. Yes, it's the number of updates which matter. The number of epochs is vague because that depends on the epoch's size.

In general, training a good model is a bit of an art. The most important hyperparameters in my experience are batch size and learning rate. Increasing the nb_nodes can help improve performance as well, at the cost of a longer training time. dim_latent_space can also make a large impact (this is the width of the network). I haven't been able to deduce a trend of how to set this for an arbitrary model / learning rate. Regardless of hyperparameters, you can expect to need 1 million batches to get below 1.5 cm positional error

hope that helps.

llyg777 commented 6 months ago

Hi,

  1. Thanks for the pr! What issues are you having?
  2. If your going to run an IK refinment step using the robot's jacobian (i.e. TRAC-IK) I recommend using inverse_kinematics_single_step_levenburg_marquardt() which is part of the Robot class (source is here). The robot variable here has this method. It will be much faster than TRAC-IK if running on multiple solutions
  3. Yes, it's the number of updates which matter. The number of epochs is vague because that depends on the epoch's size.

In general, training a good model is a bit of an art. The most important hyperparameters in my experience are batch size and learning rate. Increasing the nb_nodes can help improve performance as well, at the cost of a longer training time. dim_latent_space can also make a large impact (this is the width of the network). I haven't been able to deduce a trend of how to set this for an arbitrary model / learning rate. Regardless of hyperparameters, you can expect to need 1 million batches to get below 1.5 cm positional error

hope that helps.

  • jeremy

The problem is that when I use the provided weight file for testing, it shows that the weight file causes a mismatch between the linear transform layer modules. image

jstmn commented 6 months ago

It looks like your mixing torch.Floats and torch.Double's (something like this). Try setting all your tensors to torch.float32 dtype and running again.

If your still having trouble, please make a minimum length script that demonstrates the error

jstmn commented 6 months ago

in general, if the error is Runtime Error: x and y must have the same dtype it means your trying to do operations on tensors that have different dtypes