mikarubi / voluseg

pipeline for volumetric cell segmentation
MIT License
11 stars 9 forks source link

mask volumes with 4D data: OOM error #41

Open luiztauffer opened 1 month ago

luiztauffer commented 1 month ago

Short Java error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Short Python error track:

Cell In[8], line 1
----> 1 voluseg.step3_mask_volumes(parameters)

File /mnt/shared_storage/Github/voluseg/voluseg/_steps/step3.py:159, in mask_volumes(parameters)
    156     volume_accum.add(volume)
    158 if p.parallel_volume:
--> 159     evenly_parallelize(p.volume_names[timepoints]).foreach(add_volume)
    160 else:
    161     for name_volume in p.volume_names[timepoints]:

some references:

luiztauffer commented 1 month ago

weirdly, this error stopped happening after I restarted the spark local cluster. But good to have it here for reference, in case it happens again

luiztauffer commented 1 month ago

reopening because this error is happening consistently for the 4D dataset, both in my local machine and remote machines running with docker.

Spark keeps having issues of memory at that point in the code, we should probably improve that operation.

log_file.log

luiztauffer commented 1 month ago

setting parallel_volume=False seems avoid the problem... but this might be inefficient?

luiztauffer commented 1 month ago

maybe related image

luiztauffer commented 1 month ago

a similar error happens at step 5 - clean_cells. Similarly, the error is avoided by setting parallel_clean=False

Should we consider changing the default values of parallel_volume and parallel_clean to False? @mikarubi

mikarubi commented 1 month ago

So, just to clarify -- this is an out-of-memory error, correct? In general, we expect people to start with a lot of RAM for these analyses, so I am inclined to keep these on (so that the jobs run faster without people needing to manually turn them on). Is it possible, at all, to catch this error and return a more meaningful error message to the user? That would probably be ideal.

luiztauffer commented 1 month ago

the error is possibly due to an out of memory error allocated to the worker subprocesses. One possible solution would be to configure spark to increase this limit.

mikarubi commented 1 month ago

Ok, looking at this again.

luiztauffer commented 3 weeks ago

https://github.com/mikarubi/janelia_voluseg/blob/master/spark_properties.conf