wandb / wandb

The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.
https://wandb.ai
MIT License
8.93k stars 658 forks source link

[App]: wandb hover: Waiting for W&B process to finish... (success). #4441

Closed MJ-Zeng closed 1 year ago

MJ-Zeng commented 1 year ago

Current Behavior

wandb is stuck, it's always been like this wandb: Waiting for W&B process to finish... (success). image image

Expected Behavior

No response

Steps To Reproduce

No response

Screenshots

No response

Environment

OS:

Browsers:

Version:

Additional Context

No response

ramit-wandb commented 1 year ago

Hi @PussInCode ,

Could you share the debug.log and debug-internal.log files associated to an affected run? They should be present on your machine in the wandb folder relative to your working directory and should contain some more information about these runs so that we can dig in deeper.

ramit-wandb commented 1 year ago

​Hi @PussInCode, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know! ​

Hal9999 commented 1 year ago

Hi, I'm experiencing the same issue.

I'm running an hyperparameters sweep and after each run (made of 10 epochs) the process stucks on "wandb: Waiting for W&B process to finish... (success)."

I noticed that it is just very very slow at uploading data to the cloud dashboard, like 15-20 minute per run, much than it takes for the train.

This happens on my local computer using pure python and Jupiter notebook but also using wandb command line and config.yaml.

The same thing doesn't happen when I use Google Colab.

Luca1920342 commented 1 year ago

I also have the same issue, I init the run with run = wandb.init(project=PROJECT, sync_tensorboard=True, job_type='training')

and end with run.finish()

but it doesn't end and outputs Waiting for W&B process to finish... (success). while still running.

I'll upload my debug.log and debug-internal.log (the latter is updating while finish() is running)

debug.log debug-internal.log

2catycm commented 1 year ago

same issue occurs

ChristopherMarais commented 1 year ago

Same issue

noamsgl commented 1 year ago

Same issue

justchenhao commented 1 year ago

Same issue

alvinshao0313 commented 1 year ago

same issue

etasnadi commented 1 year ago

The same issue here.

It seems that the tfevents file can not be sent to the server and the client retries 2 more times but waits 15 minutes before between them.

hxb727628998 commented 1 year ago

same issue

Akramz commented 1 year ago

Same issue.

tangbao commented 1 year ago

same issue

RoHei commented 1 year ago

same issue

adaruna3 commented 1 year ago

same issue @ramit-wandb

mkotyushev commented 1 year ago

I have encountered the same issue on one of my agent machines I use to run sweeps.

It happens not after every sweep end so workaround could be to kill all the hung processes manually. I have executed ps aux | grep python and found following two processes:

I have killed them using pkill -9 python (although it is better to use kill -9 <PID> for each of them) and the next sweep started without any issues.

Edit: I use multiprocessing to run 5-fold CV inside each sweep as in this example, but all the inherited processes seems to end (at least GPU memory is freed). Also, on another machine (with ~ same docker environment) this issue does not appear to be a thing.

nicofirst1 commented 1 year ago

Same issue here too. It would be nice to have a verbose output to see the progress of wandb syncing.

liqing-ustc commented 1 year ago

Same issue here too.

TYTTYTTYT commented 1 year ago

Same issue here too.

yahah100 commented 1 year ago

Same issue here too.

crj1998 commented 1 year ago

Same issue here too.

betternichole commented 1 year ago

Hi, I encounter the same problem, could you let me know how to solve it. This is log file.

2023-07-17 11:33:36,226 INFO MainThread:988 [wandb_setup.py:_flush():76] Current SDK version is 0.15.5 2023-07-17 11:33:36,227 INFO MainThread:988 [wandb_setup.py:_flush():76] Configure stats pid to 988 2023-07-17 11:33:36,227 INFO MainThread:988 [wandb_setup.py:_flush():76] Loading settings from /root/.config/wandb/settings 2023-07-17 11:33:36,228 INFO MainThread:988 [wandb_setup.py:_flush():76] Loading settings from /content/drive/MyDrive/evaluation-local-explanation-time-series-master/src/wandb/settings 2023-07-17 11:33:36,228 INFO MainThread:988 [wandb_setup.py:_flush():76] Loading settings from environment variables: {} 2023-07-17 11:33:36,228 INFO MainThread:988 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False} 2023-07-17 11:33:36,228 INFO MainThread:988 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'train/train_models.py', 'program': '/content/drive/MyDrive/evaluation-local-explanation-time-series-master/src/train/train_models.py'} 2023-07-17 11:33:36,229 INFO MainThread:988 [wandb_setup.py:_flush():76] Applying login settings: {'api_key': 'REDACTED'} 2023-07-17 11:33:36,230 INFO MainThread:988 [wandb_init.py:_log_setup():507] Logging user logs to /content/drive/MyDrive/evaluation-local-explanation-time-series-master/src/wandb/run-20230717_113336-yo3q2tpy/logs/debug.log 2023-07-17 11:33:36,231 INFO MainThread:988 [wandb_init.py:_log_setup():508] Logging internal logs to /content/drive/MyDrive/evaluation-local-explanation-time-series-master/src/wandb/run-20230717_113336-yo3q2tpy/logs/debug-internal.log 2023-07-17 11:33:36,231 INFO MainThread:988 [wandb_init.py:init():547] calling init triggers 2023-07-17 11:33:36,232 INFO MainThread:988 [wandb_init.py:init():554] wandb.init called with sweep_config: {} config: {'DATASET': 'walmart', 'NUM_SERIES': 100, 'HISTORY_SIZE': 30, 'TARGET_SIZE': 1, 'STRIDE': 1, 'CONT_FEATURES': [0, 9, 10, 11, 12, 13], 'CAT_FEATURES': [1, 2, 3, 4, 5, 6, 7, 8], 'BATCH_SIZE': 512, 'EPOCHS': 200, 'PATIENCE': 10, 'MODEL': 'lstm', 'NUM_LAYERS': 1, 'NUM_UNITS': 16, 'DROPOUT': 0} 2023-07-17 11:33:36,232 INFO MainThread:988 [wandb_init.py:init():571] re-initializing run, found existing run on stack: i6g4aj1j 2023-07-17 11:33:36,233 INFO MainThread:988 [wandb_run.py:_finish():1887] finishing run yigenannan/Interpreting_TS/i6g4aj1j 2023-07-17 11:33:36,234 INFO MainThread:988 [wandb_run.py:_atexit_cleanup():2121] got exitcode: 0 2023-07-17 11:33:36,235 INFO MainThread:988 [wandb_run.py:_restore():2104] restore 2023-07-17 11:33:36,235 INFO MainThread:988 [wandb_run.py:_restore():2110] restore done 2023-07-17 11:33:40,830 INFO MainThread:988 [wandb_run.py:_footer_history_summary_info():3464] rendering history 2023-07-17 11:33:40,830 INFO MainThread:988 [wandb_run.py:_footer_history_summary_info():3496] rendering summary 2023-07-17 11:33:40,840 INFO MainThread:988 [wandb_run.py:_footer_sync_info():3423] logging synced files 2023-07-17 11:33:40,894 INFO MainThread:988 [wandb_init.py:init():596] starting backend 2023-07-17 11:33:40,894 INFO MainThread:988 [wandb_init.py:init():600] setting up manager 2023-07-17 11:33:40,900 INFO MainThread:988 [backend.py:_multiprocessing_setup():106] multiprocessing start_methods=fork,spawn,forkserver, using: spawn 2023-07-17 11:33:40,904 INFO MainThread:988 [wandb_init.py:init():606] backend started and connected 2023-07-17 11:33:40,907 INFO MainThread:988 [wandb_init.py:init():705] updated telemetry 2023-07-17 11:33:40,910 INFO MainThread:988 [wandb_init.py:init():738] communicating run to backend with 60.0 second timeout 2023-07-17 11:33:42,377 INFO MainThread:988 [wandb_run.py:_on_init():2173] communicating current version 2023-07-17 11:33:42,461 INFO MainThread:988 [wandb_run.py:_on_init():2182] got version response 2023-07-17 11:33:42,461 INFO MainThread:988 [wandb_init.py:init():789] starting run threads in backend 2023-07-17 11:33:42,519 INFO MainThread:988 [wandb_run.py:_console_start():2152] atexit reg 2023-07-17 11:33:42,523 INFO MainThread:988 [wandb_run.py:_redirect():2007] redirect: SettingsConsole.WRAP_RAW 2023-07-17 11:33:42,524 INFO MainThread:988 [wandb_run.py:_redirect():2072] Wrapping output streams. 2023-07-17 11:33:42,524 INFO MainThread:988 [wandb_run.py:_redirect():2097] Redirects installed. 2023-07-17 11:33:42,525 INFO MainThread:988 [wandb_init.py:init():830] run started, returning control to user process 2023-07-17 11:33:44,083 INFO MainThread:988 [wandb_run.py:_config_callback():1281] config_cb None None {'FEATURES': ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'IsHoliday', 'Store', 'Dept', 'Year', 'month', 'weekofmonth', 'day', 'Type', 'Size'], 'CONT_FEATURES': [0, 9, 10, 11, 12, 13], 'CAT_FEATURES': [1, 2, 3, 4, 5, 6, 7, 8]}

bhatiaabhinav commented 1 year ago

Same issue. Why is this closed?

MichaelGaliciaZ commented 1 year ago

I have the same issue. In fact wandb upload tracked information but no avoid to finish my script

liuqi8827 commented 10 months ago

Same issue.

WangX0111 commented 10 months ago

same issue

HelloWorldLTY commented 9 months ago

Same issue. Even the offline mode cannot work.

chengengliu commented 9 months ago

I have a sequence of running trainings and this just hangs the program there, though the correct stats were uploaded to wandb. A workaround solution: use ps aux | grep python to find your blocked process; in my case I want to resume the pipeline and I know the stats were saved successfully, so I kill the corresponding python process. After checking the debug-internal.log, I have found this: 2023-12-08 17:18:17,913 INFO HandlerThread:2521165 [system_monitor.py:probe():211] Finished collecting system info 2023-12-08 17:18:17,913 INFO HandlerThread:2521165 [system_monitor.py:probe():214] Publishing system info 2023-12-08 17:18:17,913 DEBUG HandlerThread:2521165 [system_info.py:_save_pip():51] Saving list of pip packages installed into the current environment 2023-12-08 17:18:17,916 DEBUG HandlerThread:2521165 [system_info.py:_save_pip():67] Saving pip packages done 2023-12-08 17:18:17,916 DEBUG HandlerThread:2521165 [system_info.py:_save_conda():74] Saving list of conda packages installed into the current environment 2023-12-08 17:18:17,924 ERROR HandlerThread:2521165 [system_info.py:_save_conda():85] Error saving conda packages: [Errno 2] No such file or directory: 'conda'
And I also checked other logs previously, they don't have such error; maybe this is what blocks the wandb.finish()?

xingzhongyu commented 9 months ago

same issue

HelloWorldLTY commented 8 months ago

I somehow addressed this problem by changing the order of wandb close and run close:

run.finish()
wandb.finish()