vye16 / slahmr

MIT License
441 stars 50 forks source link

Unable to run on Cloud GPU #30

Closed smandava98 closed 1 year ago

smandava98 commented 1 year ago

Hi. I try to run the preprocessing command (on a random test video I have) on an A10 cloud GPU but I keep getting this (I verified that all works well on Colab already):

`Traceback (most recent call last): File "/home/ubuntu/slahmr/slahmr/preproc/track.py", line 101, in main phalp_tracker = PHALP_Prime_HMR2(cfg) File "/home/ubuntu/slahmr/slahmr/preproc/track.py", line 53, in init super().init(cfg) File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/phalp/trackers/PHALP.py", line 52, in init self.setup_hmr() File "/home/ubuntu/slahmr/slahmr/preproc/track.py", line 58, in setuphmr self.HMAR = HMR2Predictor(self.cfg) File "/home/ubuntu/slahmr/slahmr/preproc/track.py", line 30, in init model, = load_hmr2() File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/hmr2/models/init.py", line 36, in load_hmr2 model = HMR2.load_from_checkpoint(checkpoint_path, strict=False, cfg=model_cfg) File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1520, in load_from_checkpoint loaded = _load_from_checkpoint( File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/pytorch_lightning/core/saving.py", line 90, in _load_from_checkpoint model = _load_state(cls, checkpoint, strict=strict, kwargs) File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/pytorch_lightning/core/saving.py", line 143, in _load_state obj = cls(_cls_kwargs) File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/hmr2/models/hmr2.py", line 59, in init self.mesh_renderer = MeshRenderer(self.cfg, faces=self.smpl.faces) File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/hmr2/utils/mesh_renderer.py", line 49, in init self.renderer = pyrender.OffscreenRenderer(viewport_width=self.img_res, File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/pyrender/offscreen.py", line 31, in init self._create() File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/pyrender/offscreen.py", line 137, in _create egl_device = egl.get_device_by_index(device_id) File "/home/ubuntu/anaconda3/envs/slahmr/lib/python3.10/site-packages/pyrender/platforms/egl.py", line 83, in get_device_by_index raise ValueError('Invalid device ID ({})'.format(device_id, len(devices))) ValueError: Invalid device ID (0)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Error executing job with overrides: ['data=video', 'data.seq=test', 'data.root=/home/ubuntu/slahmr/demo', 'run_opt=False', 'run_vis=False'] Traceback (most recent call last): File "/home/ubuntu/slahmr/slahmr/run_opt.py", line 175, in main dataset = get_dataset_from_cfg(cfg) File "/home/ubuntu/slahmr/slahmr/data/dataset.py", line 41, in get_dataset_from_cfg check_data_sources(args) File "/home/ubuntu/slahmr/slahmr/data/dataset.py", line 70, in check_data_sources preprocess_tracks(args.sources.images, args.sources.tracks, args.sources.shots) File "/home/ubuntu/slahmr/slahmr/data/vidproc.py", line 40, in preprocess_tracks phalp.process_seq( File "/home/ubuntu/slahmr/slahmr/preproc/launch_phalp.py", line 54, in process_seq os.rename(f"{resdir}/results/demo{seq}.pkl", res_path) FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/slahmr/demo/slahmr/phalp_out/results/demo_test.pkl' -> '/home/ubuntu/slahmr/demo/slahmr/phalp_out/results/test.pkl'`

I'm confused on how to solve this. I verified that my GPU works already.

geopavlakos commented 1 year ago

The error seems to be related to pyrender. You can check if the installation of pyrender is correct on your end.

You can potentially continue without pyrender, but this will not allow you to visualize any results. If you want to do that, you can disable rendering for phalp (add an extra line "render.enable=False", after the line here), and then run the main slahmr command with run_vis=False).

smandava98 commented 1 year ago

I tried that but am getting a lot of issues running this codebase on a remote server (via Lambda Labs). Still get the same error. The issue now seems to be some rendering code in the HMR 2.0 repo so I will try to dig through there or open up an issue there if things get unwieldy. Thanks for the help again!

smandava98 commented 1 year ago

Update:

Seems like there are many other issues.

Some of the codes mix with nvcc and python, which is not ideal as it requires a versioned CUDA. To get around all of this. You would need Anaconda to be functional or Install CUDA_tool_kit in an alternate location and load the PATH and LD_LIBRARY_PATH.

For HPC and for versioned C/NVCC Lambda Labs had always done this via 'Environment Modules' and Lambda does not do versioning Lambda Stack, CUDA etc.

But it still errors out in the DROID-SLAM build of src/droid_kernels.cu

It seemed to have issues with: /usr/include/Eigen/src/Core/ArrayWrapper.h(145): warning #20012-D: device annotation is ignored on a function("MatrixWrapper") that is explicitly defaulted on its first declaration

src/droid_kernels.cu(1121): error: qualified name is not allowed

src/droid_kernels.cu(1121): error: this declaration has no storage class or type specifier

src/droid_kernels.cu(1121): error: expected a ";"

src/droid_kernels.cu(1128): error: namespace "Eigen" has no member "VectorX"

...

Is there any way to get over this on a custom cloud environment?

smandava98 commented 1 year ago

Conda simply breaks a lot of things. I have no idea how it works on Google Colab and not on a remote server

smandava98 commented 1 year ago

Is there any way I can run this code on a cloud server without Anaconda?

Please let me know.

smandava98 commented 1 year ago

The issue seems to be that this repo is hardcoding CUDA 11.7. @geopavlakos

geopavlakos commented 1 year ago

We provide a working configuration in install.sh that we could test on our end. Could you try to modify it such that it works for your environment (CUDA version, etc)?

smandava98 commented 1 year ago

Thanks Georgios. I finally got it working. Required activating xvfb. Running on a cloud GPU was tough.

So I inputted a 30 second long video but it seems to only predict for the first few seconds. How can I make it predict on the whole video?

I took various screenshots and ran HMR2 and it predicts the body in all frames so I am confident that the model does predict but for some reason it is stopping early. I'm having trouble stepping through the codebase to debug this.

geopavlakos commented 1 year ago

For longer videos, we recommend breaking them down in smaller sequences (e.g., up to 200-300 frames), and running slahmr on each one of them. See also #17. You can specify the beginning and the end frame of the part you want want to process by the arguments in these two lines.

smandava98 commented 1 year ago

Ah, got it. In that other issue, you mentioned that each subsequence would have different scales.

That would cause some inconsistencies in defining how someone is moving in a longer video. Is there a way to normalize this across subsequences?

geopavlakos commented 1 year ago

Unfortunately, this is not something we support out of the box. You would probably get some form of inconsistency (i.e., jittery pose transition or location transition), but you can check how severe this is for your use case.

smandava98 commented 1 year ago

Ah got it. Thanks again for your quick and thorough responses, I really appreciate it. Couple more questions:

geopavlakos commented 1 year ago

The slahmr optimization is agnostic to the particular 2D pose model used. You can see how we call the ViTPose model here and assess if it's possible to integrate the model of your choice. We only detect the 17 COCO body keypoints, but we store them in the order of OpenPose BODY_25 format, thus the reordering you see here. If your keypoints belong in the subset of BODY_25, then the transition to a different 2D pose model should be straightforward, otherwise things might be tricky since you will need to dig a bit further in the code.

Currently, the code uses the SMPL-H model by default, but only optimizes the body pose parameters. Optimizing the hand pose is not implemented and requires non-trivial changes to the code.

smandava98 commented 1 year ago

Okay - hands might be a little tough. However, I see that the toe and heels are part of the OpenPose Body 25 format but the outputs do not get the feet orientations and positions correctly. Is there a way to fix this?

I thought modifying the run_additional_models section would work but running the run_opt.py script doesn't run that section of the code so I am a bit confused on where to modify this.

geopavlakos commented 1 year ago

We use a ViTpose model that only detects the 17 COCO body joints (these do not include the feet keypoints), but for convenience, in the code we store them in the order of the OpenPose BODY_25 format. Since the feet keypoints stay empty, there is no constraint on the toes during the optimization, thus the discrepancy with the image evidence. With that being said, if you actually have these keypoints, results are expected to improve.

The run_additional_models() function is indeed running with the run_opt.py script as part of the pre-processing (might be hard to follow all the function calls). This line in particular calls the 2D keypoint detection model. This is where you could augment the keypoint detections with feet keypoints if you have them.

smandava98 commented 1 year ago

Thanks for the above assistance - I was able to get the foot part working with my pose model (btw which is HRNet and is comparable to ViTPose in performance, plus it has whole body keypoints). Might be better to change to this that way we can optimize all keypoints in the body. HRNet has quantitatively a bit less performance (however looking at the visuals it is barely noticeable) but the hand and feet keypoints might make a big enough difference in getting a full representation of the body so I believe this is worth looking into.

Btw, this work is truly amazing and the codebase is great but its a bit hard to understand when attempting to make changes. Can you point me to the files where I would need to make changes to include hands and optimize the hands? The pose model I am using already has keypoints for hands and various fingers but it does not fit in the OpenPose Body 25 format however I do see that OpenPose does have formatting for hands.

Furthermore, I do see the smpl_to_openpose function that translates OpenPose hands to SMPL (presumably SMPL-H) format. I also assume I might need to make some changes in 4D Humans but just sifting through the code is not yielding much context.

geopavlakos commented 1 year ago

I'm glad the foot keypoint fitting worked for you! We will consider this update in the future, but we would need to test it first.

The hand fitting would require non-trivial updates to the codebase. Among others, one would need to make sure that hand keypoints are read and considered (slahmr/data/tools.py), consider this update in the body model (slahmr/body_model/body_model.py, slahmr/body_model/utils.py), add a variable for hand pose and transfer it across all the files that are responsible for the optimization (e.g., slahmr/optim/base_scene.py, slahmr/optim/moving_scene.py, slahmr/optim/optimizers.py), add a pose and smoothness prior for the hands (slahmr/optim/losses.py) and update the visualization functions (slahmr/vis/output.py).