jstmn / ikflow

Open source implementation to the paper "IKFlow: Generating Diverse Inverse Kinematics Solutions"
https://sites.google.com/view/ikflow/home
Other
51 stars 5 forks source link

Inquire about the training and the installation #16

Open YupuLu opened 1 month ago

YupuLu commented 1 month ago

Hi Jeremy,

Thank you for providing the official implementation code! Some confusing points came up during applying the code and I am wondering if you can help me.

  1. I am new to poetry and am using Ubuntu 20.04. When solving the dependencies, it always tried to install klampt = "0.9.1.post6" and returned error since there is no supporting version. I have no idea why this happened since it seems to me that the version is fixed to "0.9.1.post5" in the jrl project.
  2. How long will the training takes for the panda robot, for the example, "python scripts/train.py --robot_name=panda --nb_nodes=12 --batch_size=128 --learning_rate=0.0005"? In the paper, it says the training is with a NVIDIA GeForce RTX 2080 Ti graphics card. For me it took several hours for one epoch, but the max epoch is set to 5000. How is that possible to finish a training process? Is there anything that I missed?

I will be greatly appreciated if I can hear from you soon.

Best regards, Yupu

jstmn commented 4 weeks ago

Hi Yupu,

  1. I'm not sure what's causing that. I just bumped klampt to 0.9.2 in Jrl, can you try a fresh install?
  2. The max-epoch won't be reached! Training is designed to only stop manually. For the final trained models I let training run for several days. Checkout this comment (and the whole thread) to see what an expected time vs. pose error plot should look like: https://github.com/jstmn/ikflow/issues/6#issuecomment-1983985747

Also if you're training a panda robot, use these parameters: --robot_name=panda --nb_nodes=12 --coeff_fn_internal_size=1024 --coeff_fn_config=3 --dim_latent_space=7 --batch_size=512 --learning_rate=0.00005 --gradient_clip_val=1 --dataset_tags non-self-colliding

Good luck! let me know if you have any other issues.

jstmn commented 4 weeks ago

And just to be clear, there are pretrained models you can use: python scripts/evaluate.py --testset_size=500 --model_name=panda__full__lp191_5.25m for example. All models found here: https://github.com/jstmn/ikflow/blob/master/ikflow/model_descriptions.yaml

YupuLu commented 3 weeks ago

Thank you for your reply! It really helps clarifying my confusion. I will try to install and test if everything works.

Still, I am wondering if the installation requirement can be loosed, such as the version of python (only 3.8) and pytorch (2.3). Will it work with pytorch 2.0 or python 3.9? If so, it will be easier to cooperate with many other projects for expansion.

YupuLu commented 3 weeks ago

Also I am somewhat new to robotic manipulator. If I want to utilize a learning model with jrl to other application (for instance, using pybullet) for the same robot like franka panda, is there anything that I should be aware of? Thanks in advance!

YupuLu commented 3 weeks ago

I managed to make it worked with python 3.9 and pytorch 2.0.1. Still not sure what will happen and will report anything if it is valuable.

jstmn commented 3 weeks ago

Hi @YupuLu ,

Great, sounds like you got it working. Right now I only have python 3.8 allowed because it would be extra work to ensure it works on other python versions. I would guess the code should work fine for later python versions too. I think pytorch just needs to be > 2.0 because that's when setting the default type and device was introduced.

Did you do it by editing pyproject.toml? If so can you post it in this thread so others can see.

"If I want to utilize a learning model with jrl to other application (for instance, using pybullet) for the same robot like franka panda, is there anything that I should be aware of?"

The thing you need to check is whether the urdfs are the same. To ensure they are the same, you can use the urdf used by IKFlow, which will be stored at ~/.cache/jrl/temp_urdfs/. Otherwise, you'll need to verify the pybullet, and ikflow urdfs are identical.

Once you get ikflow working with pybullet, can you share the steps required in this thread? I'm curious to hear myself, and will be helpful for others

YupuLu commented 3 weeks ago

Hi Jeremy @jstmn, Edit1: I checked the package version and the version of torch is still 2.4.0... Edit2: Tested multiple times and found out a complicated way to install torch==2.0.1. I have no idea why '--no-update' did not work when I use poetry lock --no-update and poetry kept updating torch to 2.4.0, so I just comment all the lines related to torch.

I am still quite unfamiliar with poetry so I am not sure what I did and why it worked. But here are my installation steps:

  1. Create a new conda environment with python 3.9.
  2. Clone the jrl project, delete the file poetry.lock, and then modify the pyproject.toml (change python version requirement to python = "^3.8.0" and comment the line torch = "2.3")
  3. Run (maybe poetry lock --no-update first) poetry install to install jrl.
  4. It seems that sometimes poetry doesn't work well with torch 2.0.1 (If you have installed torch, after the installation, an error returns when you import jrl). I just reinstalled torch with pip locally to fix this problem because I downloaded the package before.
  5. Clone this ikflow project, delete the file poetry.lock, and then modify the pyproject.toml (change python version requirement to python = "^3.8.0" and comment lines FrEIA = "0.2", jrl = ... and pytorch-lightning = "1.8.6").
  6. Run (maybe poetry lock --no-update first) poetry install to install ikflow.
  7. Install FrEIA, pytorch-lightning, and pytorch using pip: pip install FrEIA==0.2 pytorch-lightning==1.8.6 torch==2.0.1+cu117

    Did you do it by editing pyproject.toml? If so can you post it in this thread so others can see.

I am developing my project and will test to see if everything works fine or not.

Thank you for your suggestions. I haven't tried such things before and it may take time for me to finish the verification. Wish me good luck :)

The thing you need to check is whether the urdfs are the same. To ensure they are the same, you can use the urdf used by IKFlow, which will be stored at ~/.cache/jrl/temp_urdfs/. Otherwise, you'll need to verify the pybullet, and ikflow urdfs are identical.

YupuLu commented 4 days ago

Hi Jeremy @jstmn , I notice that the data loading is not totally consistent. During training, some resources related to the robot model will be always loaded to "cuda:0". This problem can be reproduced when I call get_robot('panda') with DEVICE='cuda:3' in jrl.config.py. | 0 N/A N/A 2527243 C python 510MiB | | 0 N/A N/A 2527450 C python 510MiB | | 3 N/A N/A 2527243 C python 3456MiB | | 3 N/A N/A 2527450 C python 3456MiB |

jstmn commented 4 days ago

Sounds like DEVICE from jrl/config.py isn't being used everywhere. Which variables specifically have the wrong cuda device?

YupuLu commented 4 days ago

Well I did a simple test just now and here is the script I used with device='cuda:3':

from jrl.robots import get_robot
import time
if __name__ == "__main__":
    time.sleep(1000)

As long as I used the first line, the problem happened. Even if I commented all the contents in jrl.robots.py except the function get_robot(), the memory (510MiB) related to cuda:0 will still be occupied. So I suppose that the fault is not related to the variables in the jrl project but has something related to the installation?

YupuLu commented 4 days ago

BTW, Would you mind providing the negative log likelihood curve for reference during training, just like the post you mentioned before?

Screen Shot 2024-03-07 at 8 42 01 AM
jstmn commented 3 days ago

What's the actual error your getting? Can you include the stack trace

jstmn commented 3 days ago

Sure, here's the curve: image

YupuLu commented 2 days ago

What's the actual error your getting? Can you include the stack trace

Actually there was no error. More easily, I entered Python through terminal with import jrl and then monitored with nvidia-smi in another section, there was a 510 MiB usage related to gpu0. But I am confusing why the importing action will lead to the gpu usage.

I notice the output like "Warp 0.10.1 initialized.....CUDA Toolkit: ...Devices:...Kernel cache:..." when mporting jrl. It seems to me that this step will take up the gpu usage so I suppose it has nothing to do with the package itself?

jstmn commented 1 day ago

It could be from the forward-kinematics cache operation done here: https://github.com/jstmn/jrl/blob/master/jrl/robot.py#L236

The '"Warp 0.10.1 initialized.....CUDA Toolkit: ...Devices:...Kernel cache:..." ' happens whenever you call import warp, so that's probably not it.