tianhaowuhz / human-assisting-dex-grasp

MIT License
35 stars 2 forks source link

segmentation fault when running rl_eval.sh #8

Open AII6 opened 1 month ago

AII6 commented 1 month ago

my python version is 3.8 and can run the isaacgym example successfully, but when I run the rl_eval.sh it has an error. Here I set rl_device: 'cpu' in the config.yaml. The following information is the details of error. Could you please help me with that?

(graspgf) galen@galen-ThinkPad:~/human-assisting-dex-grasp-main$ bash rl_eval.sh bash: /home/galen/anaconda3/envs/graspgf/lib/libtinfo.so.6: no version information available (required by bash) Importing module 'gym_38' (/home/galen/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /home/galen/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json /home/galen/anaconda3/envs/graspgf/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if not hasattr(tensorboard, "__version__") or LooseVersion( /home/galen/anaconda3/envs/graspgf/lib/python3.8/site-packages/torch/utils/cpp_extension.py:25: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html from pkg_resources import packaging # type: ignore[attr-defined] /home/galen/anaconda3/envs/graspgf/lib/python3.8/site-packages/pkg_resources/__init__.py:3154: DeprecationWarning: Deprecated call topkg_resources.declare_namespace('google'). Implementing implicit namespace packages (as specified in PEP 420) is preferred topkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg) /home/galen/anaconda3/envs/graspgf/lib/python3.8/site-packages/pkg_resources/__init__.py:3154: DeprecationWarning: Deprecated call topkg_resources.declare_namespace('mpl_toolkits'). Implementing implicit namespace packages (as specified in PEP 420) is preferred topkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg) PyTorch version 1.12.1 Device count 1 /home/galen/isaacgym/python/isaacgym/_bindings/src/gymtorch Using /home/galen/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Emitting ninja build file /home/galen/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... close dis:0.1 load object from dataset !!!!!!!!!!!!!!!!!!!!!!!!!!! unseencategory Obs type: gf meta data generated Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled JointSpec type free not yet supported! create env in:0.07950949668884277 Subscene 0 has 1 articulations rl_eval.sh: 第 19 行: 7916 段错误 python ./Runners/EvalGFPPO.py --constrained --num_envs=1 --dataset_type='unseencategory' --score_model_path="Ckpt/gf" --t0=0.005 --run_device_id=0 --mode='eval' --eval_times=1 --seed=0 --exp_name="ours" --eval_name="ours_unseen" --model_dir="Ckpt/gfppo.pt"

AII6 commented 1 month ago

Here I set the pipeline="gpu", and that may not have enough memory. But when I set the pipeline="cpu", the visualization interface appears but crashes after a very short time.And the output of terminal remains stuck as shown below.

PyTorch version 1.12.1

Device count 1

/home/galen/isaacgym/python/isaacgym/_bindings/src/gymtorch

Using /home/galen/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...

Emitting ninja build file /home/galen/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja...

Building extension module gymtorch...

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)

ninja: no work to do.

Loading extension module gymtorch...

close dis:0.1

load object from dataset !!!!!!!!!!!!!!!!!!!!!!!!!!!

unseencategory

Obs type: gf

meta data generated

Not connected to PVD

Physics Engine: PhysX

Physics Device: cpu

GPU Pipeline: disabled

JointSpec type free not yet supported!

create env in:0.09011673927307129

Subscene 0 has 1 articulations
AII6 commented 1 month ago

Here I set the pipeline="gpu", and that may not have enough memory. But when I set the pipeline="cpu", the visualization interface appears but crashes after a very short time.And the output of terminal remains stuck as shown below.

PyTorch version 1.12.1

Device count 1

/home/galen/isaacgym/python/isaacgym/_bindings/src/gymtorch

Using /home/galen/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...

Emitting ninja build file /home/galen/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja...

Building extension module gymtorch...

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)

ninja: no work to do.

Loading extension module gymtorch...

close dis:0.1

load object from dataset !!!!!!!!!!!!!!!!!!!!!!!!!!!

unseencategory

Obs type: gf

meta data generated

Not connected to PVD

Physics Engine: PhysX

Physics Device: cpu

GPU Pipeline: disabled

JointSpec type free not yet supported!

create env in:0.09011673927307129

Subscene 0 has 1 articulations

I have identified where it gets stuck. It happens while running the _reset_simulator() function in the ShadowHandCon class. Specifically, the line self.gym.destroy_sim(self.sim) keeps running but never completes. Perhaps it has encountered a deadlock. Could you provide some assistance?

tianhaowuhz commented 1 month ago

I actually only tested with gpu, never run on CPU, so I am not sure whether run on CPU will cause other problems. As for the segmentation fault, this may be the same as this issue #2

AII6 commented 1 month ago

I actually only tested with gpu, never run on CPU, so I am not sure whether run on CPU will cause other problems. As for the segmentation fault, this may be the same as this issue #2

Acctually I have set environment numbers=5 and it still stuck in the line self.gym.destroy_sim(self.sim). Could you please test it on cpu? May be it is a common problem

tianhaowuhz commented 1 month ago

I tested using the CPU and encountered the same issue, though I’m really not unsure why it’s getting stuck. I still recommend using the GPU, since running Isaac Gym on the CPU vs. the GPU can lead to differences in performance, and if I recall correctly, some APIs may also vary between the two. You can try to run by setting num_envs=1, it may require around 6GB of GPU memory. Also, since the current code will create the env everytime for each test, it is usual to see the viewer crash and open again.

AII6 commented 1 month ago

OK, thank you for your testing. Now I change to another computer and run it successfully on gpu. Thanks.