Open kaushikb258 opened 2 years ago
Hey @kaushikb258, how long did you wait? The CRF for potsdam slices can take a few minutes to complete
I ran the eval code on Potsdam for over 4-5 hours and still no result (the code is still running). Even training didn't take this long.
Yes that definitely sounds like its stuck appreciate the context here. Perhaps set the num workers in this line
To something small and see if that stops you from getting stuck. If that's the case its probably due to starvation or something
I decreased the num workers, but no progress. So I made a serial code for CRF and this works now. Attaching below if it can help others... (github is screwing up the indendation!)
def batched_crf(img_tensor, prob_tensor): batch_size = list(img_tensor.size())[0] img_tensor_cpu = img_tensor.detach().cpu() prob_tensor_cpu = prob_tensor.detach().cpu() out = [] for i in range(batchsize): out = dense_crf(img_tensor_cpu[i], prob_tensorcpu[i]) out.append(out) return torch.cat([torch.from_numpy(arr).unsqueeze(0) for arr in out], dim=0)
It can be avoided by simply replacing https://github.com/mhamilton723/STEGO/blob/d1341b9bac32f27039db1c924eb8c4b4e6b9298a/src/eval_segmentation.py#L118 with
from multiprocessing import get_context
with get_context('spawn').Pool(cfg.num_workers + 5) as pool:
...
The eval_segmentation.py gets stuck for potsdam data. The issue is in batched_crf() in the following line:
outputs = pool.map(_apply_crf, zip(img_tensor.detach().cpu(), prob_tensor.detach().cpu()))
The code never proceeds further. One proc is waiting for others indefinitely. Any suggestions?