[App]: wandb hover: Waiting for W&B process to finish... (success).

MJ-Zeng commented 1 year ago

Current Behavior

wandb is stuck, it's always been like this wandb: Waiting for W&B process to finish... (success).

Expected Behavior

No response

Steps To Reproduce

No response

Screenshots

No response

Environment

OS:

Browsers:

Version:

Additional Context

No response

ramit-wandb commented 1 year ago

Hi @PussInCode ,

Could you share the debug.log and debug-internal.log files associated to an affected run? They should be present on your machine in the wandb folder relative to your working directory and should contain some more information about these runs so that we can dig in deeper.

ramit-wandb commented 1 year ago

Hi @PussInCode, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Hal9999 commented 1 year ago

Hi, I'm experiencing the same issue.

I'm running an hyperparameters sweep and after each run (made of 10 epochs) the process stucks on "wandb: Waiting for W&B process to finish... (success)."

I noticed that it is just very very slow at uploading data to the cloud dashboard, like 15-20 minute per run, much than it takes for the train.

This happens on my local computer using pure python and Jupiter notebook but also using wandb command line and config.yaml.

The same thing doesn't happen when I use Google Colab.

Luca1920342 commented 1 year ago

I also have the same issue, I init the run with run = wandb.init(project=PROJECT, sync_tensorboard=True, job_type='training')

and end with run.finish()

but it doesn't end and outputs Waiting for W&B process to finish... (success). while still running.

I'll upload my debug.log and debug-internal.log (the latter is updating while finish() is running)

debug.log debug-internal.log

2catycm commented 1 year ago

same issue occurs

ChristopherMarais commented 1 year ago

Same issue

noamsgl commented 1 year ago

Same issue

justchenhao commented 1 year ago

Same issue

alvinshao0313 commented 1 year ago

same issue

etasnadi commented 1 year ago

The same issue here.

It seems that the tfevents file can not be sent to the server and the client retries 2 more times but waits 15 minutes before between them.

hxb727628998 commented 1 year ago

same issue

Akramz commented 1 year ago

Same issue.

tangbao commented 1 year ago

same issue

RoHei commented 1 year ago

same issue

adaruna3 commented 1 year ago

same issue @ramit-wandb

mkotyushev commented 1 year ago

I have encountered the same issue on one of my agent machines I use to run sweeps.

It happens not after every sweep end so workaround could be to kill all the hung processes manually. I have executed ps aux | grep python and found following two processes:

python -c from joblib.externals.loky.backend.resource_tracker import main; main(63, False)
python -c from multiprocessing.resource_tracker import main;main(70)

I have killed them using pkill -9 python (although it is better to use kill -9 <PID> for each of them) and the next sweep started without any issues.

Edit: I use multiprocessing to run 5-fold CV inside each sweep as in this example, but all the inherited processes seems to end (at least GPU memory is freed). Also, on another machine (with ~ same docker environment) this issue does not appear to be a thing.

nicofirst1 commented 1 year ago

Same issue here too. It would be nice to have a verbose output to see the progress of wandb syncing.

liqing-ustc commented 1 year ago

Same issue here too.

TYTTYTTYT commented 1 year ago

Same issue here too.

yahah100 commented 1 year ago

Same issue here too.

crj1998 commented 1 year ago

Same issue here too.

betternichole commented 1 year ago

Hi, I encounter the same problem, could you let me know how to solve it. This is log file.

2023-07-17 11:33:36,226 INFO 2023-07-17 11:33:36,227 INFO 2023-07-17 11:33:36,227 INFO 2023-07-17 11:33:36,228 INFO 2023-07-17 11:33:36,228 INFO 2023-07-17 11:33:36,228 INFO 2023-07-17 11:33:36,228 INFO 2023-07-17 11:33:36,229 INFO 2023-07-17 11:33:36,230 INFO 2023-07-17 11:33:36,231 INFO 2023-07-17 11:33:36,231 INFO 2023-07-17 11:33:36,232 INFO config: {'DATASET': 2023-07-17 11:33:36,232 INFO 2023-07-17 11:33:36,233 INFO 2023-07-17 11:33:36,234 INFO 2023-07-17 11:33:36,235 INFO 2023-07-17 11:33:36,235 INFO 2023-07-17 11:33:40,830 INFO 2023-07-17 11:33:40,830 INFO 2023-07-17 11:33:40,840 INFO 2023-07-17 11:33:40,894 INFO 2023-07-17 11:33:40,894 INFO 2023-07-17 11:33:40,900 INFO 2023-07-17 11:33:40,904 INFO 2023-07-17 11:33:40,907 INFO 2023-07-17 11:33:40,910 INFO 2023-07-17 11:33:42,377 INFO 2023-07-17 11:33:42,461 INFO 2023-07-17 11:33:42,461 INFO 2023-07-17 11:33:42,519 INFO 2023-07-17 11:33:42,523 INFO 2023-07-17 11:33:42,524 INFO 2023-07-17 11:33:42,524 INFO 2023-07-17 11:33:42,525 INFO 2023-07-17 11:33:44,083 INFO MainThread:988 [wandb_setup.py:_flush():76] Current SDK version is 0.15.5 MainThread:988 [wandb_setup.py:_flush():76] Configure stats pid to 988 MainThread:988 [wandb_setup.py:_flush():76] Loading settings from /root/.config/wandb/settings MainThread:988 [wandb_setup.py:_flush():76] Loading settings from /content/drive/MyDrive/evaluation-local-explanation-time-series-master/src/wandb/settings MainThread:988 [wandb_setup.py:_flush():76] Loading settings from environment variables: {} MainThread:988 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False} MainThread:988 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'train/train_models.py', 'program': '/content/drive/MyDrive/evaluation-local-explanation-time-series-master/src/train/train_models.py'} MainThread:988 [wandb_setup.py:_flush():76] Applying login settings: {'api_key': 'REDACTED'} MainThread:988 [wandb_init.py:_log_setup():507] Logging user logs to /content/drive/MyDrive/evaluation-local-explanation-time-series-master/src/wandb/run-20230717_113336-yo3q2tpy/logs/debug.log MainThread:988 [wandb_init.py:_log_setup():508] Logging internal logs to /content/drive/MyDrive/evaluation-local-explanation-time-series-master/src/wandb/run-20230717_113336-yo3q2tpy/logs/debug-internal.log MainThread:988 [wandb_init.py:init():547] calling init triggers MainThread:988 [wandb_init.py:init():554] wandb.init called with sweep_config: {} 'walmart', 'NUM_SERIES': 100, 'HISTORY_SIZE': 30, 'TARGET_SIZE': 1, 'STRIDE': 1, 'CONT_FEATURES': [0, 9, 10, 11, 12, 13], 'CAT_FEATURES': [1, 2, 3, 4, 5, 6, 7, 8], 'BATCH_SIZE': 512, 'EPOCHS': 200, 'PATIENCE': 10, 'MODEL': 'lstm', 'NUM_LAYERS': 1, 'NUM_UNITS': 16, 'DROPOUT': 0} MainThread:988 [wandb_init.py:init():571] re-initializing run, found existing run on stack: i6g4aj1j MainThread:988 [wandb_run.py:_finish():1887] finishing run yigenannan/Interpreting_TS/i6g4aj1j MainThread:988 [wandb_run.py:_atexit_cleanup():2121] got exitcode: 0 MainThread:988 [wandb_run.py:_restore():2104] restore MainThread:988 [wandb_run.py:_restore():2110] restore done MainThread:988 [wandb_run.py:_footer_history_summary_info():3464] rendering history MainThread:988 [wandb_run.py:_footer_history_summary_info():3496] rendering summary MainThread:988 [wandb_run.py:_footer_sync_info():3423] logging synced files MainThread:988 [wandb_init.py:init():596] starting backend MainThread:988 [wandb_init.py:init():600] setting up manager MainThread:988 [backend.py:_multiprocessing_setup():106] multiprocessing start_methods=fork,spawn,forkserver, using: spawn MainThread:988 [wandb_init.py:init():606] backend started and connected MainThread:988 [wandb_init.py:init():705] updated telemetry MainThread:988 [wandb_init.py:init():738] communicating run to backend with 60.0 second timeout MainThread:988 [wandb_run.py:_on_init():2173] communicating current version MainThread:988 [wandb_run.py:_on_init():2182] got version response MainThread:988 [wandb_init.py:init():789] starting run threads in backend MainThread:988 [wandb_run.py:_console_start():2152] atexit reg MainThread:988 [wandb_run.py:_redirect():2007] redirect: SettingsConsole.WRAP_RAW MainThread:988 [wandb_run.py:_redirect():2072] Wrapping output streams. MainThread:988 [wandb_run.py:_redirect():2097] Redirects installed. MainThread:988 [wandb_init.py:init():830] run started, returning control to user process MainThread:988 [wandb_run.py:_config_callback():1281] config_cb None None {'FEATURES': ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'IsHoliday', 'Store', 'Dept', 'Year', 'month', 'weekofmonth', 'day', 'Type', 'Size'], 'CONT_FEATURES': [0, 9, 10, 11, 12, 13], 'CAT_FEATURES': [1, 2, 3, 4, 5, 6, 7, 8]}

bhatiaabhinav commented 1 year ago

Same issue. Why is this closed?

MichaelGaliciaZ commented 1 year ago

I have the same issue. In fact wandb upload tracked information but no avoid to finish my script

liuqi8827 commented 10 months ago

Same issue.

WangX0111 commented 10 months ago

same issue

HelloWorldLTY commented 9 months ago

Same issue. Even the offline mode cannot work.

chengengliu commented 9 months ago

I have a sequence of running trainings and this just hangs the program there, though the correct stats were uploaded to wandb. A workaround solution: use ps aux | grep python to find your blocked process; in my case I want to resume the pipeline and I know the stats were saved successfully, so I kill the corresponding python process. After checking the debug-internal.log, I have found this: 2023-12-08 17:18:17,913 INFO HandlerThread:2521165 [system_monitor.py:probe():211] Finished collecting system info 2023-12-08 17:18:17,913 INFO HandlerThread:2521165 [system_monitor.py:probe():214] Publishing system info 2023-12-08 17:18:17,913 DEBUG HandlerThread:2521165 [system_info.py:_save_pip():51] Saving list of pip packages installed into the current environment 2023-12-08 17:18:17,916 DEBUG HandlerThread:2521165 [system_info.py:_save_pip():67] Saving pip packages done 2023-12-08 17:18:17,916 DEBUG HandlerThread:2521165 [system_info.py:_save_conda():74] Saving list of conda packages installed into the current environment 2023-12-08 17:18:17,924 ERROR HandlerThread:2521165 [system_info.py:_save_conda():85] Error saving conda packages: [Errno 2] No such file or directory: 'conda'
And I also checked other logs previously, they don't have such error; maybe this is what blocks the wandb.finish()?

xingzhongyu commented 9 months ago

same issue

HelloWorldLTY commented 8 months ago

I somehow addressed this problem by changing the order of wandb close and run close:

run.finish()
wandb.finish()

wandb / wandb