wandb / server

W&B Server is the self hosted version of Weights & Biases
MIT License
243 stars 20 forks source link

Connection stuck without any error message #84

Open Janus-Shiau opened 2 years ago

Janus-Shiau commented 2 years ago

I have used W&B local for years. Recently, I sometime get stuck when finish training run.

There is no error or warning messages at all, the last terminal message I got on client side is:

Terminal Messages

wandb: Synced RUN_NAME: SERVER_ADDRESS
wandb: Synced 7 W&B file(s), 58800 media file(s), 0 artifact file(s) and 3 other file(s)
wandb: Find logs at: ../artifacts/wandb/run-20220730_190417-2ql6zqpk/logs

Environment & Version

My local instance is running on Ubuntu 16.04, and its version is 0.15.0. My client side is running on Ubuntu 16.04 or 18.04, and its version is 0.12.21.

I really enjoy the experience of using W&B local, thank you guys for develop this awesome MLOps tool. And I hope this issue can be reproduced and solved soon.

vanpelt commented 2 years ago

Looks like you logged 58800 media files like images or video. That will take a long time to upload and might fill up our overwhelm your disk. You should reduce the number of media logged or purchase a license for a commercial version that can connect to cloud storage.

Janus-Shiau commented 2 years ago

The total size of these media files is not heavy. It's about 75 MB.

The Synchronization stuck also happen to a run without any media file.

vanpelt commented 2 years ago

You can find details about what our process is doing by looking at the wandb/debug-internal.log process relative to your script. We would need to see that to understand what's making the process stall.

Janus-Shiau commented 2 years ago

This is the log in wandb/debug-internal.log. I copy the INFO and DEBUG right after last scan save is logged.

2022-08-16 12:36:41,934 INFO    SenderThread:4422 [sender.py:transition_state():459] send defer: 8
2022-08-16 12:36:41,935 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:41,941 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:41,941 INFO    HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 8
2022-08-16 12:36:41,942 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:41,942 INFO    SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 8
2022-08-16 12:36:41,942 INFO    SenderThread:4422 [file_pusher.py:finish():171] shutting down file pusher
2022-08-16 12:36:42,044 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,044 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,142 INFO    Thread-11 :4422 [sender.py:transition_state():459] send defer: 9
2022-08-16 12:36:42,143 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,143 INFO    HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 9
2022-08-16 12:36:42,143 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,143 INFO    SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 9
2022-08-16 12:36:42,154 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,213 INFO    SenderThread:4422 [sender.py:transition_state():459] send defer: 10
2022-08-16 12:36:42,214 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,229 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,229 INFO    HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 10
2022-08-16 12:36:42,229 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,230 INFO    SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 10
2022-08-16 12:36:42,230 INFO    SenderThread:4422 [sender.py:transition_state():459] send defer: 11
2022-08-16 12:36:42,231 DEBUG   SenderThread:4422 [sender.py:send():302] send: final
2022-08-16 12:36:42,231 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,231 DEBUG   SenderThread:4422 [sender.py:send():302] send: footer
2022-08-16 12:36:42,231 INFO    HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 11
2022-08-16 12:36:42,232 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,232 INFO    SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 11
2022-08-16 12:36:42,332 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,335 DEBUG   SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,355 INFO    SenderThread:4422 [file_pusher.py:join():176] waiting for file pusher
2022-08-16 12:36:42,474 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: sampled_history
2022-08-16 12:36:42,489 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: get_summary
2022-08-16 12:36:42,491 DEBUG   HandlerThread:4422 [handler.py:handle_request():141] handle_request: shutdown
2022-08-16 12:36:42,491 INFO    HandlerThread:4422 [handler.py:finish():810] shutting down handler
2022-08-16 12:36:43,231 INFO    WriterThread:4422 [datastore.py:close():279] close: ../artifacts/wandb/run-20220815_150813-2jk5mgtn/run-2jk5mgtn.wandb
2022-08-16 12:36:43,372 INFO    SenderThread:4422 [sender.py:finish():1312] shutting down sender
2022-08-16 12:36:43,372 INFO    SenderThread:4422 [file_pusher.py:finish():171] shutting down file pusher
2022-08-16 12:36:43,372 INFO    SenderThread:4422 [file_pusher.py:join():176] waiting for file pusher
2022-08-16 12:36:48,400 INFO    MainThread:4422 [internal.py:handle_exit():80] Internal process exited

Thank you for your time, I hope this issue can solved soon.

Janus-Shiau commented 2 years ago

I got different message today as following, just for your reference.

2022-08-23 10:29:30,880 INFO    SenderThread:15179 [sender.py:transition_state():459] send defer: 8
2022-08-23 10:29:30,880 INFO    SenderThread:15179 [sender.py:finish():1312] shutting down sender
2022-08-23 10:29:30,880 INFO    SenderThread:15179 [file_pusher.py:finish():171] shutting down file pusher
2022-08-23 10:29:30,880 INFO    SenderThread:15179 [file_pusher.py:join():176] waiting for file pusher
2022-08-23 10:29:31,006 INFO    WriterThread:15179 [datastore.py:close():279] close: ../artifacts/wandb/run-20220822_185928-2l3ms0qk/run-2l3ms0qk.wandb
2022-08-23 10:29:31,452 ERROR   StreamThr :15179 [internal.py:wandb_internal():165] Thread HandlerThread:
Traceback (most recent call last):
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 51, in run
    self._run()
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 98, in _run
    record = self._input_record_q.get(timeout=1)
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/queues.py", line 111, in get
    res = self._recv_bytes()
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError