Closed mpm116 closed 3 months ago
Update:
I removed 'tk' and reinstalled it in the conda environment. This also meant for some reason I had to reinstall python 3.9... It now seems to be stable, will report back on this
Currently running with 1 worker and 16 threads...
Further update: Same error occured after:
(INFO) (reconstruct.py) (18-Jun-24 18:41:21) # [Train Epoch: 0/108] [19904/57795 particles]
Even when setting number of workers and threads to 1, I still see two pt_main_thread processes running
We are seeing other users run into this issue as well (see #4) — we have not encountered this particular error, but based on what I'm seeing in the error messages across the two threads, it looks related to these problems, both of which were solved by using the Agg
backend to matplotlib
:
https://stackoverflow.com/questions/14694408/runtimeerror-main-thread-is-not-in-main-loop
https://stackoverflow.com/questions/27147300/matplotlib-tcl-asyncdelete-async-handler-deleted-by-the-wrong-thread
This matches the behaviour we are seeing from both of your cases, as matplotlib is used during reconstruction when making summaries of epochs, and that is indeed when drgnai train
seems to be hanging!
I've created a patch branch agg-backend
to try this fix; you can pull it down from the remote repo and reinstall using pip install .
, or create a new folder where this branch is checked out and install from there:
git clone git@github.com:ml-struct-bio/drgnai.git --branch agg-backend drgnai-pub_agg-folder/ --single-branch
cd drgnai-pub_agg-folder/
pip install .
Please let us know if this helps, and thank you both for taking the time to write detailed bug reports!
Hi,
My problem was solved with the new version of the program. Thank you!
Best regards, Gabor
Same here.. running now. Hopefully it can work some magic! Thanks for the help
Just encountered a cuda out of memory error - which I guess liekly a data/particle number and system setup related issue. Assuming I can play with the number of workers/threads? for reference I has 60 k 230 box size particles, and was using 4 x 24gb GPU cards
I would recommend using the lazy
parameter for less memory-intensive (but slightly slower) data loading, as well as playing around with the batch sizes in the configs.yaml
:
lazy: true
batch_size_known_poses: 32
batch_size_hps: 8
batch_size_sgd: 256
Shown above are the default batch sizes currently used in DRGN-AI but try making them smaller to see if that helps!
To chime in: I have a large dataset - 1.1 million of particles in 128 box size, around 78 GB. I have a memory issue as well. I tried to run DRGN-AI on it using a server with 2 A100 gpus with 80GB GPU memory and 512 GB RAM. It gets killed by apparently overloading the memory: "/software/miniconda3/envs/drgnai-env/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Killed" Is the lazy method can solve this? Is there a way to avoid loading too many images into the memory?
Yup that is definitely a dataset for which lazy
is necessary in most cases, and indeed precisely the strategy it employs is loading images as they are needed, rather than trying to load the entire dataset all at once at the beginning!
More details can be found in the cryodrgn.source.ImageSource
class, in which the .data
attribute stores (or does not store if using lazy
) the loaded dataset, and the _images()
method retrieves the images from .data
or from file as needed!
I've played about with number of workers/threads (including 1 worker and 1 thread) but not really sure what to try. Attached is run log:
(INFO) (reconstruct.py) (18-Jun-24 12:55:11) # =====> SGD Epoch: -1 finished in 0:03:03.966651; total loss = 1.087761 (INFO) (analysis.py) (18-Jun-24 12:55:15) Explained variance ratio: (INFO) (analysis.py) (18-Jun-24 12:55:15) [0.26263918 0.25768008 0.24865601 0.23102474] (INFO) (reconstruct.py) (18-Jun-24 12:55:21) Will use pose search on 57795 particles (INFO) (reconstruct.py) (18-Jun-24 12:55:21) Will make a full summary at the end of this epoch Exception ignored in: <function Image.del at 0x7f70a700aee0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 4017, in del self.tk.call('image', 'delete', self.name) RuntimeError: main thread is not in main loop Exception ignored in: <function Variable.del at 0x7f70a6ff25e0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 363, in del if self._tk.getboolean(self._tk.call("info", "exists", self._name)): RuntimeError: main thread is not in main loop Exception ignored in: <function Variable.del at 0x7f70a6ff25e0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 363, in del if self._tk.getboolean(self._tk.call("info", "exists", self._name)): RuntimeError: main thread is not in main loop Exception ignored in: <function Variable.del at 0x7f70a6ff25e0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 363, in del if self._tk.getboolean(self._tk.call("info", "exists", self._name)): RuntimeError: main thread is not in main loop Exception ignored in: <function Variable.del at 0x7f70a6ff25e0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 363, in del if self._tk.getboolean(self._tk.call("info", "exists", self._name)): RuntimeError: main thread is not in main loop Tcl_AsyncDelete: async handler deleted by the wrong thread Aborted (core dumped)
This then automatically runs, before everything hangs:
/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
Any suggestions welcome and of course if any more information required please let me know! Thanks!