mhamilton723 / STEGO

Unsupervised Semantic Segmentation by Distilling Feature Correspondences
MIT License
724 stars 147 forks source link

eval gets stuck indefinitely #16

Open kaushikb258 opened 2 years ago

kaushikb258 commented 2 years ago

The eval_segmentation.py gets stuck for potsdam data. The issue is in batched_crf() in the following line:

outputs = pool.map(_apply_crf, zip(img_tensor.detach().cpu(), prob_tensor.detach().cpu()))

The code never proceeds further. One proc is waiting for others indefinitely. Any suggestions?

mhamilton723 commented 2 years ago

Hey @kaushikb258, how long did you wait? The CRF for potsdam slices can take a few minutes to complete

kaushikb258 commented 2 years ago

I ran the eval code on Potsdam for over 4-5 hours and still no result (the code is still running). Even training didn't take this long.

mhamilton723 commented 2 years ago

Yes that definitely sounds like its stuck appreciate the context here. Perhaps set the num workers in this line

https://github.com/mhamilton723/STEGO/blob/d1341b9bac32f27039db1c924eb8c4b4e6b9298a/src/eval_segmentation.py#L118

To something small and see if that stops you from getting stuck. If that's the case its probably due to starvation or something

kaushikb258 commented 2 years ago

I decreased the num workers, but no progress. So I made a serial code for CRF and this works now. Attaching below if it can help others... (github is screwing up the indendation!)

def batched_crf(img_tensor, prob_tensor): batch_size = list(img_tensor.size())[0] img_tensor_cpu = img_tensor.detach().cpu() prob_tensor_cpu = prob_tensor.detach().cpu() out = [] for i in range(batchsize): out = dense_crf(img_tensor_cpu[i], prob_tensorcpu[i]) out.append(out) return torch.cat([torch.from_numpy(arr).unsqueeze(0) for arr in out], dim=0)

Supgb commented 2 years ago

It can be avoided by simply replacing https://github.com/mhamilton723/STEGO/blob/d1341b9bac32f27039db1c924eb8c4b4e6b9298a/src/eval_segmentation.py#L118 with

from multiprocessing import get_context

with get_context('spawn').Pool(cfg.num_workers + 5) as pool:
    ...