hendrycks / anomaly-seg

The Combined Anomalous Object Segmentation (CAOS) Benchmark
MIT License
154 stars 20 forks source link

Single GPU #3

Closed giannifranchi closed 4 years ago

giannifranchi commented 4 years ago

Does the code works with single GPU. I manage to make it work with multiple GPU, but each time I try with just one single GPU I have an issue:

self.run()

File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, *self._kwargs) File "/home/user/venv_pytorch1/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", line 21, in _pin_memory_loop r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/user/venv_pytorch1/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 284, in rebuild_storage_fd fd = df.detach() segm_downsampling_rate: 8 Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(self._args, **self._kwargs) File "/home/user/venv_pytorch1/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", line 21, in _pin_memory_loop r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/user/venv_pytorch1/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 284, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client answer_challenge(c, authkey) File "/usr/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge message = connection.recv_bytes(256) # reject large message File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer

xksteven commented 4 years ago

Thank you for pointing this out.

In the train.py file you can add the following lines to make it work for single gpu. It will however break the multigpu training I believe.

    if type(batch_data) == type([]):
        batch_data = batch_data[0]
        batch_data["img_data"] = batch_data["img_data"].cuda()
        batch_data["seg_label"] = batch_data["seg_label"].cuda()
    loss, acc = segmentation_module(batch_data)

Let me know if you still have any issues after making the change.

To run with single gpu make sure you run with flag --gpu 0 the default is to run on 4 GPUs.

Update: I have also updated the README to hopefully make the data formatting process clearer.

LT1st commented 2 years ago

Wonder which train.py u are using? my train.py is cloned from https://github.com/CSAILVision/semantic-segmentation-pytorch/tree/5c2e9f6f3a231ae9ea150a0019d161fe2896efcf and the batch_data does not exist in train.py

xksteven commented 2 years ago

The comment and debugging that occurred for this issue were probably around this point in the other repo. We since updated our code to work with the newer refactored codebase and include both model and runtime improvements. Since we did not plan to continue to keep up with the other repo we made it a submodule at the specific target version we used at the time of publication.

xksteven commented 2 years ago

Suggested solution to try out: https://github.com/CSAILVision/semantic-segmentation-pytorch/issues/58