ml-struct-bio / drgnai

GNU General Public License v3.0
24 stars 3 forks source link

Failure to run train related to tkinter #2

Closed mpm116 closed 3 months ago

mpm116 commented 5 months ago

I've played about with number of workers/threads (including 1 worker and 1 thread) but not really sure what to try. Attached is run log:

(INFO) (reconstruct.py) (18-Jun-24 12:55:11) # =====> SGD Epoch: -1 finished in 0:03:03.966651; total loss = 1.087761 (INFO) (analysis.py) (18-Jun-24 12:55:15) Explained variance ratio: (INFO) (analysis.py) (18-Jun-24 12:55:15) [0.26263918 0.25768008 0.24865601 0.23102474] (INFO) (reconstruct.py) (18-Jun-24 12:55:21) Will use pose search on 57795 particles (INFO) (reconstruct.py) (18-Jun-24 12:55:21) Will make a full summary at the end of this epoch Exception ignored in: <function Image.del at 0x7f70a700aee0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 4017, in del self.tk.call('image', 'delete', self.name) RuntimeError: main thread is not in main loop Exception ignored in: <function Variable.del at 0x7f70a6ff25e0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 363, in del if self._tk.getboolean(self._tk.call("info", "exists", self._name)): RuntimeError: main thread is not in main loop Exception ignored in: <function Variable.del at 0x7f70a6ff25e0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 363, in del if self._tk.getboolean(self._tk.call("info", "exists", self._name)): RuntimeError: main thread is not in main loop Exception ignored in: <function Variable.del at 0x7f70a6ff25e0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 363, in del if self._tk.getboolean(self._tk.call("info", "exists", self._name)): RuntimeError: main thread is not in main loop Exception ignored in: <function Variable.del at 0x7f70a6ff25e0> Traceback (most recent call last): File "/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/tkinter/init.py", line 363, in del if self._tk.getboolean(self._tk.call("info", "exists", self._name)): RuntimeError: main thread is not in main loop Tcl_AsyncDelete: async handler deleted by the wrong thread Aborted (core dumped)

This then automatically runs, before everything hangs:

/hlowdata4/mpm116/software/miniconda3/envs/drgnai-env/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Any suggestions welcome and of course if any more information required please let me know! Thanks!

mpm116 commented 5 months ago

Update:

I removed 'tk' and reinstalled it in the conda environment. This also meant for some reason I had to reinstall python 3.9... It now seems to be stable, will report back on this

Currently running with 1 worker and 16 threads...

Further update: Same error occured after:

(INFO) (reconstruct.py) (18-Jun-24 18:41:21) # [Train Epoch: 0/108] [19904/57795 particles]

Even when setting number of workers and threads to 1, I still see two pt_main_thread processes running

michal-g commented 4 months ago

We are seeing other users run into this issue as well (see #4) — we have not encountered this particular error, but based on what I'm seeing in the error messages across the two threads, it looks related to these problems, both of which were solved by using the Agg backend to matplotlib: https://stackoverflow.com/questions/14694408/runtimeerror-main-thread-is-not-in-main-loop https://stackoverflow.com/questions/27147300/matplotlib-tcl-asyncdelete-async-handler-deleted-by-the-wrong-thread

This matches the behaviour we are seeing from both of your cases, as matplotlib is used during reconstruction when making summaries of epochs, and that is indeed when drgnai train seems to be hanging!

I've created a patch branch agg-backend to try this fix; you can pull it down from the remote repo and reinstall using pip install ., or create a new folder where this branch is checked out and install from there:

git clone git@github.com:ml-struct-bio/drgnai.git --branch agg-backend drgnai-pub_agg-folder/ --single-branch
cd drgnai-pub_agg-folder/
pip install .

Please let us know if this helps, and thank you both for taking the time to write detailed bug reports!

papaig commented 4 months ago

Hi,

My problem was solved with the new version of the program. Thank you!

Best regards, Gabor

mpm116 commented 4 months ago

Same here.. running now. Hopefully it can work some magic! Thanks for the help

mpm116 commented 4 months ago

Just encountered a cuda out of memory error - which I guess liekly a data/particle number and system setup related issue. Assuming I can play with the number of workers/threads? for reference I has 60 k 230 box size particles, and was using 4 x 24gb GPU cards

michal-g commented 4 months ago

I would recommend using the lazy parameter for less memory-intensive (but slightly slower) data loading, as well as playing around with the batch sizes in the configs.yaml:

lazy: true
batch_size_known_poses: 32
batch_size_hps: 8
batch_size_sgd: 256

Shown above are the default batch sizes currently used in DRGN-AI but try making them smaller to see if that helps!

papaig commented 4 months ago

To chime in: I have a large dataset - 1.1 million of particles in 128 box size, around 78 GB. I have a memory issue as well. I tried to run DRGN-AI on it using a server with 2 A100 gpus with 80GB GPU memory and 512 GB RAM. It gets killed by apparently overloading the memory: "/software/miniconda3/envs/drgnai-env/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Killed" Is the lazy method can solve this? Is there a way to avoid loading too many images into the memory?

michal-g commented 4 months ago

Yup that is definitely a dataset for which lazy is necessary in most cases, and indeed precisely the strategy it employs is loading images as they are needed, rather than trying to load the entire dataset all at once at the beginning!

More details can be found in the cryodrgn.source.ImageSource class, in which the .data attribute stores (or does not store if using lazy) the loaded dataset, and the _images() method retrieves the images from .data or from file as needed!