Open Southyang opened 1 year ago
When running this command line can you also run watch -n 0.5 nvidia-smi
to check if the processes are running on the GPU? You should see GPU power utilization going up for the GPU you are running on.
Unfortunately, not able to reproduce this issue on my end. Maybe you can also post the exact conda environment that you are using (conda env export > environment.yml
), and I can investigate that further
After waiting one hour.
run watch -n 0.5 nvidia-smi
and python pasture_runner.py -a src.models.agent_fbe_owl -n 4 --arch B32 --center
run conda env export > environment.yml
name: cow
channels:
- aihabitat
- pytorch
- defaults
- conda-forge
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro/
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- attrs=21.4.0=pyhd3eb1b0_0
- brotli=1.0.9=he6710b0_2
- bzip2=1.0.8=h7b6447c_0
- c-ares=1.18.1=h7f8727e_0
- ca-certificates=2022.6.15=ha878542_0
- certifi=2022.6.15=py37h89c1867_0
- cmake=3.14.0=h52cb24c_0
- cycler=0.11.0=pyhd3eb1b0_0
- dbus=1.13.18=hb2f20db_0
- expat=2.4.4=h295c915_0
- ffmpeg=4.3=hf484d3e_0
- fontconfig=2.13.1=h6c09931_0
- fonttools=4.25.0=pyhd3eb1b0_0
- freetype=2.11.0=h70c0345_0
- giflib=5.2.1=h7b6447c_0
- gitdb=4.0.9=pyhd8ed1ab_0
- gitpython=3.1.27=pyhd8ed1ab_0
- glib=2.69.1=h4ff587b_1
- gmp=6.2.1=h295c915_3
- gnutls=3.6.15=he1e5248_0
- gst-plugins-base=1.14.0=hbbd80ab_1
- gstreamer=1.14.0=h28cd5cc_2
- habitat-sim-mutex=1.0=headless_nobullet
- headless=2.0=0
- icu=58.2=he6710b0_3
- imageio=2.19.3=pyhcf75d05_0
- imageio-ffmpeg=0.4.7=pyhd8ed1ab_0
- jbig=2.1=h7f98852_2003
- jpeg=9e=h166bdaf_1
- kiwisolver=1.4.2=py37h7cecad7_1
- krb5=1.19.2=hac12032_0
- lame=3.100=h7f98852_1001
- lcms2=2.12=h3be6417_0
- ld_impl_linux-64=2.38=h1181459_1
- lerc=2.2.1=h2531618_0
- libblas=3.9.0=15_linux64_openblas
- libcblas=3.9.0=15_linux64_openblas
- libcurl=7.82.0=h0b77cf5_0
- libdeflate=1.7=h27cfd23_5
- libedit=3.1.20210910=h7f8727e_0
- libev=4.33=h7f8727e_1
- libffi=3.3=he6710b0_2
- libgcc-ng=11.2.0=h1234567_1
- libgfortran-ng=12.1.0=h69a702a_16
- libgfortran5=12.1.0=hdcd56e2_16
- libgomp=11.2.0=h1234567_1
- libiconv=1.16=h7f8727e_2
- libidn2=2.3.2=h7f8727e_0
- liblapack=3.9.0=15_linux64_openblas
- libllvm11=11.1.0=h3826bc1_1
- libnghttp2=1.46.0=hce63b2e_0
- libopenblas=0.3.20=h043d6bf_1
- libpng=1.6.37=h21135ba_2
- libssh2=1.10.0=h8f2d780_0
- libstdcxx-ng=11.2.0=h1234567_1
- libtasn1=4.16.0=h27cfd23_0
- libtiff=4.3.0=hf544144_1
- libunistring=0.9.10=h27cfd23_0
- libuuid=1.0.3=h7f8727e_2
- libwebp=1.2.2=h55f646e_0
- libwebp-base=1.2.2=h7f98852_1
- libxcb=1.13=h7f98852_1004
- libxml2=2.9.14=h74e7548_0
- llvmlite=0.38.0=py37h4ff587b_0
- lz4-c=1.9.3=h9c3ff4c_1
- matplotlib=3.5.1=py37h06a4308_1
- matplotlib-base=3.5.1=py37ha18d171_1
- munkres=1.1.4=py_0
- ncurses=6.3=h7f8727e_2
- nettle=3.7.3=hbbd107a_1
- numba=0.55.1=py37h51133e4_0
- numpy=1.21.6=py37h976b520_0
- olefile=0.46=pyh9f0ad1d_1
- openh264=2.1.1=h4ff587b_0
- openjpeg=2.4.0=hb52868f_1
- openssl=1.1.1o=h7f8727e_0
- packaging=21.3=pyhd3eb1b0_0
- pcre=8.45=h295c915_0
- pip=21.2.2=py37h06a4308_0
- pthread-stubs=0.4=h36c2ea0_1001
- pyparsing=3.0.9=pyhd8ed1ab_0
- pyqt=5.9.2=py37h05f1152_2
- python=3.7.13=h12debd9_0
- python-dateutil=2.8.2=pyhd3eb1b0_0
- python_abi=3.7=2_cp37m
- qt=5.9.7=h5867ecd_1
- quaternion=2022.4.1=py37h540881e_0
- readline=8.1.2=h7f8727e_1
- rhash=1.4.1=h3c74f83_1
- scipy=1.7.3=py37hf2a6cf1_0
- setuptools=61.2.0=py37h06a4308_0
- sip=4.19.8=py37hf484d3e_0
- six=1.16.0=pyhd3eb1b0_1
- smmap=3.0.5=pyhd3eb1b0_0
- sqlite=3.38.5=hc218d9a_0
- tbb=2021.5.0=hd09550d_0
- tk=8.6.12=h1ccaba5_0
- tornado=6.1=py37h540881e_3
- tqdm=4.64.0=py37h06a4308_0
- typing-extensions=4.2.0=hd8ed1ab_1
- typing_extensions=4.2.0=pyha770c72_1
- wheel=0.37.1=pyhd3eb1b0_0
- xorg-fixesproto=5.0=h7f98852_1002
- xorg-inputproto=2.3.2=h7f98852_1002
- xorg-kbproto=1.0.7=h7f98852_1002
- xorg-libx11=1.7.2=h7f98852_0
- xorg-libxau=1.0.9=h7f98852_0
- xorg-libxcursor=1.2.0=h7f98852_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xorg-libxext=1.3.4=h7f98852_1
- xorg-libxfixes=5.0.3=h7f98852_1004
- xorg-libxi=1.7.10=h7f98852_0
- xorg-libxinerama=1.1.4=h9c3ff4c_1001
- xorg-libxrandr=1.5.2=h7f98852_1
- xorg-libxrender=0.9.10=h7f98852_1003
- xorg-randrproto=1.5.0=h7f98852_1001
- xorg-renderproto=0.11.1=h7f98852_1002
- xorg-xextproto=7.3.0=h7f98852_1002
- xorg-xproto=7.0.31=h27cfd23_1007
- xz=5.2.5=h7f8727e_1
- zlib=1.2.12=h7f8727e_2
- zstd=1.5.2=ha4553b6_0
- pip:
- absl-py==1.1.0
- ai2thor==4.3.0
- aiohttp==3.8.1
- aiosignal==1.2.0
- allenact==0.5.1
- allenact-plugins==0.5.1
- astunparse==1.6.3
- async-timeout==4.0.2
- asynctest==0.13.0
- aws-requests-auth==0.4.3
- botocore==1.27.18
- box2d-py==2.3.8
- cachetools==5.2.0
- charset-normalizer==2.0.12
- click==8.1.3
- cloudpickle==1.6.0
- colour==0.1.5
- datasets==2.3.2
- decorator==4.4.2
- dill==0.3.5.1
- docker-pycreds==0.4.0
- filelock==3.7.1
- flask==2.1.2
- flatbuffers==1.12
- frozenlist==1.3.0
- fsspec==2022.5.0
- ftfy==6.1.1
- gast==0.4.0
- google-auth==2.8.0
- google-auth-oauthlib==0.4.6
- google-pasta==0.2.0
- grpcio==1.47.0
- gym==0.19.0
- gym-minigrid==1.0.3
- gym-notices==0.0.8
- h5py==3.7.0
- habitat-sim==0.2.1
- huggingface-hub==0.8.1
- idna==3.3
- importlib-metadata==4.12.0
- itsdangerous==2.1.2
- jinja2==3.1.2
- jmespath==1.0.1
- joblib==1.1.0
- keras==2.9.0
- keras-preprocessing==1.1.2
- libclang==14.0.1
- markdown==3.3.7
- markupsafe==2.1.1
- moviepy==1.0.3
- msgpack==1.0.4
- multidict==6.0.2
- multiprocess==0.70.13
- networkx==2.6.3
- oauthlib==3.2.0
- opencv-python==4.6.0.66
- opt-einsum==3.3.0
- pandas==1.3.5
- pathtools==0.1.2
- patsy==0.5.2
- pickle5==0.0.12
- pillow==8.4.0
- proglog==0.1.10
- progressbar2==4.0.0
- promise==2.3
- protobuf==3.19.4
- psutil==5.9.1
- pyarrow==8.0.0
- pyasn1==0.4.8
- pyasn1-modules==0.2.8
- pyglet==1.5.26
- pyquaternion==0.9.9
- python-utils==3.3.3
- python-xlib==0.31
- pytz==2022.1
- pyyaml==6.0
- regex==2022.6.2
- requests==2.28.0
- requests-oauthlib==1.3.1
- responses==0.18.0
- rsa==4.8
- scikit-learn==1.0.2
- sentry-sdk==1.9.0
- setproctitle==1.2.3
- shortuuid==1.0.9
- tensorboard==2.9.1
- tensorboard-data-server==0.6.1
- tensorboard-plugin-wit==1.8.1
- tensorboardx==2.5.1
- tensorflow==2.9.1
- tensorflow-estimator==2.9.0
- tensorflow-io-gcs-filesystem==0.26.0
- termcolor==1.1.0
- threadpoolctl==3.1.0
- timm==0.6.7
- tokenizers==0.12.1
- torch==1.11.0
- torchaudio==0.11.0
- torchvision==0.12.0
- transformers==4.21.1
- trimesh==3.14.0
- urllib3==1.26.9
- wandb==0.13.2
- wcwidth==0.2.5
- werkzeug==2.1.2
- wrapt==1.14.1
- xxhash==3.0.0
- yacs==0.1.8
- yarl==1.7.2
- zipp==3.8.0
prefix: /home/pi/anaconda3/envs/cow
I use a GPU to deploy this project, after I run this line of code
python pasture_runner.py -a src.models.agent_fbe_owl -n 1 --arch B32 --center
This prompt appeared
Traceback (most recent call last): File "pasture_runner.py", line 278, in <module> main() File "pasture_runner.py", line 273, in main test=False File "/home/southyang/southyang/code/cow/robothor_challenge.py", line 470, in inference timeout=1000) File "/home/southyang/anaconda3/envs/cow/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
I read the code in the corresponding part, but I didn't find where the problem is, how can I solve it?
I'm having the same issue
The program is stuck here, is it a problem with ai2thor?
robothor_challenge.py
, 267 lines
As I continued to debug, I found that it would get stuck here. I want to know,if this file exist? ai2thor/fifo_server.py
@Southyang are you still running into issues?
yeah, I still have this issue. And I encountered another problem. I want to run the Grad-CAM localization strategy alone, and wrote the following code:
def main():
prompts_path = "./prompt_templates/simple_template.json"
env_type = EnvTypes.ROBOTHOR
class_type = ClassTypes.REGULAR
classes, classes_clip, agent_height, floor_tolerance, negate_action, templates = get_env_class_vars(prompts_path, env_type, class_type)
clip_model_name = "ViT-B/32"
threshold = 0.625 # clip weight
device_number = 0
device = torch.device("cpu")
if torch.cuda.is_available():
device = torch.device("cuda:{0}".format(device_number))
center_only = False
print(clip_model_name, classes, classes_clip, templates, threshold, device, center_only)
Gard_model = ClipGrad(clip_model_name, classes, classes_clip,
templates, threshold, device,
center_only=center_only)
# print(Gard_model.class_to_language_feature['HousePlant'])
pic = Image.open("./scene2.png")
image = Gard_model.preprocess(pic).unsqueeze(0).to(device)
image_relevance = Gard_model.forward(image, 'HousePlant')
print(image_relevance.shape)
bg_img = plt.imread('./scene2.png')
# normalize
adjusted_tensor = np.resize(image_relevance, (bg_img.shape[1], bg_img.shape[0]))
denominator = np.max(adjusted_tensor) - np.min(adjusted_tensor)
if denominator != 0:
normalized = (adjusted_tensor - np.min(adjusted_tensor)) / denominator
else:
normalized = adjusted_tensor
# print(normalized)
plt.imshow(bg_img)
plt.imshow(normalized, alpha=0.2, cmap='hot')
plt.title('Grad-CAM Blended')
plt.show()
if __name__ == '__main__':
main()
But the output is like this.
logits_per_image: tensor([[19.7161]], device='cuda:0', grad_fn=<MmBackward0>)
image_relevance:
tensor([[8.9781e-04, 8.9781e-04, 8.9781e-04, ..., 2.7663e-04, 2.7663e-04,
2.7663e-04],
[8.9781e-04, 8.9781e-04, 8.9781e-04, ..., 2.7663e-04, 2.7663e-04,
2.7663e-04],
[8.9781e-04, 8.9781e-04, 8.9781e-04, ..., 2.7663e-04, 2.7663e-04,
2.7663e-04],
...,
[1.4318e-03, 1.4318e-03, 1.4318e-03, ..., 9.6540e-05, 9.6540e-05,
9.6540e-05],
[1.4318e-03, 1.4318e-03, 1.4318e-03, ..., 9.6540e-05, 9.6540e-05,
9.6540e-05],
[1.4318e-03, 1.4318e-03, 1.4318e-03, ..., 9.6540e-05, 9.6540e-05,
9.6540e-05]], device='cuda:0')
image_relevance * self.gradient_scalar > self.threshold:
tensor([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]], device='cuda:0')
torch.Size([224, 224])
After entering the interpret_vit
function(clip_grad.py 84 line), the probability value is reduced to very small, unable to draw the heat map.
when I run this command python scripts/startx.py
It always shows _XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed _XSERVTransMakeAllCOTSServerListeners: server already running (EE) Fatal server error: (EE) Cannot establish any listening sockets - Make sure an X server isn't already running(EE) (EE) Please consult the The X.Org Foundation support at http://wiki.x.org for help. (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information. (EE) (EE) Server terminated with error (1). Closing log file.
if this is the reason of queue.Empty problem? if anyone had solved the queue.Empty problem?
I'm having the same issue,too. I want to know if anyone has found a solution to it now
I use a GPU to deploy this project, after I run this line of code
python pasture_runner.py -a src.models.agent_fbe_owl -n 1 --arch B32 --center
This prompt appeared
Traceback (most recent call last): File "pasture_runner.py", line 278, in <module> main() File "pasture_runner.py", line 273, in main test=False File "/home/southyang/southyang/code/cow/robothor_challenge.py", line 470, in inference timeout=1000) File "/home/southyang/anaconda3/envs/cow/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
I read the code in the corresponding part, but I didn't find where the problem is, how can I solve it?
I'm having the same issue,too. I want to know if anyone has found a solution to it now
@CatLiZi @Southyang I have encountered the same issue. May I ask have you solved this problem ?
I use a GPU to deploy this project, after I run this line of code
python pasture_runner.py -a src.models.agent_fbe_owl -n 1 --arch B32 --center
This prompt appeared
Traceback (most recent call last): File "pasture_runner.py", line 278, in <module> main() File "pasture_runner.py", line 273, in main test=False File "/home/southyang/southyang/code/cow/robothor_challenge.py", line 470, in inference timeout=1000) File "/home/southyang/anaconda3/envs/cow/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
I read the code in the corresponding part, but I didn't find where the problem is, how can I solve it?
I think the problem is in line 475. receive_queue.get(timeout=1000)
in 470 will not raise a TimeoutError
exception, it will just raise a queue.Empty
exception. So i think it just need to change line 475 and 484 to except queue.Empty:
to catch the Empty
exception as in line 274. The coder may have found the issue and just change it in line 274 but forget the other two.
@CatLiZi @Southyang I have encountered the same issue. May I ask have you solved this problem ?
@LinqingZhong @Southyang As I delved deeper into debugging, I found that the reason for the queue being empty was consistent with the question raised by a previous researcher. During this code sentence, it got stuck and the thread was blocked, making it impossible to execute subsequent inference code
Has anyone found a solution to this code because I can no longer delve deeper into it
我这边问题出在第439行定义的x_display
上面,我是ssh
到服务器上运行的。一是(可能)服务器上面没登陆一个图像桌面,第267行初始化Controller
会报错,二是我的x_display
只有设为:2
才能正常打开,:0.0
,:1.0
,:2.1
等都是打不开的,应该参考cow/issues/7可以查到可以设置什么数值。如果是没图像的服务器可能得用CloudRendering。
For me, it stuck because of setting x_display
in line 439, i ssh to the server to run the project. Firstly, if i dont log in a desktop, it errors in line 267. Also, i set x_display
to :2
only for which i can successfully initialize Controller
, not for :0.0
, :1.0
, :2.1
, cow/issues/7 may be helpful. For headless(how to name it?) server, CloudRendering might be needed.
I use a GPU to deploy this project, after I run this line of code
This prompt appeared
I read the code in the corresponding part, but I didn't find where the problem is, how can I solve it?