mask volumes with 4D data: OOM error

mikarubi / voluseg

pipeline for volumetric cell segmentation

MIT License

11 stars 9 forks source link

mask volumes with 4D data: OOM error #41

Open luiztauffer opened 1 month ago

luiztauffer commented 1 month ago

Short Java error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Short Python error track:

Cell In[8], line 1
----> 1 voluseg.step3_mask_volumes(parameters)

File /mnt/shared_storage/Github/voluseg/voluseg/_steps/step3.py:159, in mask_volumes(parameters)
    156     volume_accum.add(volume)
    158 if p.parallel_volume:
--> 159     evenly_parallelize(p.volume_names[timepoints]).foreach(add_volume)
    160 else:
    161     for name_volume in p.volume_names[timepoints]:

some references:

https://stackoverflow.com/a/1393503/11483674

luiztauffer commented 1 month ago

weirdly, this error stopped happening after I restarted the spark local cluster. But good to have it here for reference, in case it happens again

luiztauffer commented 1 month ago

reopening because this error is happening consistently for the 4D dataset, both in my local machine and remote machines running with docker.

Spark keeps having issues of memory at that point in the code, we should probably improve that operation.

log_file.log

luiztauffer commented 1 month ago

setting parallel_volume=False seems avoid the problem... but this might be inefficient?

luiztauffer commented 1 month ago

maybe related

luiztauffer commented 1 month ago

a similar error happens at step 5 - clean_cells. Similarly, the error is avoided by setting parallel_clean=False

Should we consider changing the default values of parallel_volume and parallel_clean to False? @mikarubi

mikarubi commented 1 month ago

So, just to clarify -- this is an out-of-memory error, correct? In general, we expect people to start with a lot of RAM for these analyses, so I am inclined to keep these on (so that the jobs run faster without people needing to manually turn them on). Is it possible, at all, to catch this error and return a more meaningful error message to the user? That would probably be ideal.

luiztauffer commented 1 month ago

the error is possibly due to an out of memory error allocated to the worker subprocesses. One possible solution would be to configure spark to increase this limit.

mikarubi commented 1 month ago

Ok, looking at this again.

The divide by 0 warning is not a problem, but rather just represents either missing fluorescence data or an ill-posed segmentation problem (that voluseg subsequently corrects for). We shouldn't worry about fixing it, and could just suppress it.
The out of memory error is most likely due to the size of the dataset. If it's possible to catch this error, and issue a descriptive error to the user (either advising them to increase memory or set parallels to zero that would be probably be enough.
As a possible addition, we could do some back of the envelope calculation to check if the requested memory will be enough for the job, and issue a warning if we think it won't.

luiztauffer commented 3 weeks ago

https://github.com/mikarubi/janelia_voluseg/blob/master/spark_properties.conf