Wait at 0% while training.Epoch 0: 0% 0/13000

cats0212 commented 2 years ago

Hi, i run 'python -m surfemb.scripts.train ycbv', and ./data/bop/ycbv/models have 21 .ply files, and 80 folder in ./data/bop/ycbv/train_real. I did not use synth imgs, but the program always 0%. Is the program preprocessing image information,crop object from img?I waited a dozen hours and it was still 0%. like Epoch 0: 0%. The python is still running.Do I have to wait a long time before train model?Are you in a similar situation?

But if I just have 3 .ply files in ./data/bop/ycba/models,and 1 folder in ./data/bop/ycbv/train_real,it will soon(3 - 4 minutes) train cnn .Finally, the trained model is obtained. For example, obj_000008.ply, obj_000014.ply, obj_000021.ply in ./data/bop/ycba/models, 000000 in ./data/bop/ycbv/train_real(imgs in 000000 only have 3 types of objects, obj_8, obj_14, obj_21).

If I want to train all the objects in ycbv at once, Do I have to wait longer?I'm using a server, CPU performance is not weak.

{ "os": "Linux-4.15.0-175-generic-x86_64-with-debian-buster-sid", "python": "3.7.11", "heartbeatAt": "2022-04-09T09:34:05.311715", "startedAt": "2022-04-09T09:34:02.496814", "docker": null, "gpu": "GeForce RTX 3090", "gpu_count": 8, "cpu_count": 40, "cuda": null, "args": [], "state": "running", "program": "-m surfemb.scripts.train", "git": { "remote": "https://github.com/rasmushaugaard/surfemb.git", "commit": "46f46ddc5670848d696968dc8ec65c8ce62b16a8" }, "email": "qwzf@qq.com", "root": "/home/aa/prjs/surfemb", "host": "sddx-PR4908P", "username": "aa", "executable": "/home/aa/anaconda3/envs/d2_1.10/bin/python" }

logs: 2022-04-09 11:02:46,213 INFO 2022-04-09 11:02:46,214 INFO 2022-04-09 11:02:46,214 INFO 2022-04-09 11:02:46,214 2022-04-09 11:02:46,214 INFO 2022-04-09 11:02:46,214 INFO 2022-04-09 11:02:46,214 INFO 2022-04-09 11:02:46,215 INFO 2022-04-09 11:02:46,215 INFO config: {} 2022-04-09 11:02:46,215 INFO 2022-04-09 11:02:46,228 INFO 2022-04-09 11:02:46,232 INFO 2022-04-09 11:02:46,238 INFO 2022-04-09 11:02:46,578 INFO 2022-04-09 11:02:49,104 INFO 2022-04-09 11:02:49,106 INFO 2022-04-09 11:02:49,107 INFO 2022-04-09 11:02:49,108 INFO 2022-04-09 11:02:49,109 INFO 2022-04-09 11:02:49,130 INFO 2022-04-09 11:07:50,141 MainThread:16337 [wandb_setup.py:_flush():75] Loading settings from /home/aa/.config/wandb/settings MainThread:16337 [wandb_setup.py:_flush():75] Loading settings from /home/aa/prjs/bcnet/pose/surfemb/wandb/settings MainThread:16337 [wandb_setup.py:_flush():75] Loading settings from environment variables: {'api_key': 'REDACTED', 'mode': 'offline', '_require_service': 'True'} WARNING MainThread:16337 [wandb_setup.py:_flush():75] Could not find program at -m surfemb.scripts.train MainThread:16337 [wandb_setup.py:_flush():75] Inferring run settings from compute environment: {'program_relpath': None, 'program': '-m surfemb.scripts.train'} MainThread:16337 [wandb_init.py:_log_setup():405] Logging user logs to /home/aa/prjs/bcnet/pose/surfemb/wandb/offline-run-20220409_110246-3fewafz3/logs/debug.log MainThread:16337 [wandb_init.py:_log_setup():406] Logging internal logs to /home/aa/prjs/bcnet/pose/surfemb/wandb/offline-run-20220409_110246-3fewafz3/logs/debug-internal.log MainThread:16337 [wandb_init.py:init():439] calling init triggers MainThread:16337 [wandb_init.py:init():443] wandb.init called with sweep_config: {} MainThread:16337 [wandb_init.py:init():492] starting backend MainThread:16337 [backend.py:_multiprocessing_setup():101] multiprocessing start_methods=fork,spawn,forkserver, using: spawn MainThread:16337 [wandb_init.py:init():501] backend started and connected MainThread:16337 [wandb_init.py:init():565] updated telemetry MainThread:16337 [wandb_init.py:init():625] starting run threads in backend MainThread:16337 [wandb_run.py:_console_start():1733] atexit reg MainThread:16337 [wandb_run.py:_redirect():1606] redirect: SettingsConsole.WRAP MainThread:16337 [wandb_run.py:_redirect():1643] Wrapping output streams. MainThread:16337 [wandb_run.py:_redirect():1667] Redirects installed. MainThread:16337 [wandb_init.py:init():664] run started, returning control to user process MainThread:16337 [wandb_run.py:_config_callback():992] config_cb None None {'n_objs': 21, 'emb_dim': 12, 'n_pos': 1024, 'n_neg': 1024, 'lr_cnn': 0.0003, 'lr_mlp': 3e-05, 'mlp_name': 'siren', 'mlp_hidden_features': 256, 'mlp_hidden_layers': 2, 'key_noise': 0.001, 'warmup_steps': 2000, 'separate_decoders': True, 'pa_sigma': 0.0, 'align_corners': False, 'dataset': 'ycbv', 'n_valid': 200, 'res_data': 256, 'res_crop': 224, 'batch_size': 16, 'num_workers': 'None', 'min_visib_fract': 0.1, 'max_steps': 500000, 'gpus': 2, 'debug': False, 'ckpt': 'None', 'synth': False, 'real': True} WARNING MsgRouterThr:16337 [router.py:message_loop():76] message_loop has been closed

rasmushaugaard commented 2 years ago

I've only trained models with all the objects. Loading objects and crop info should not take more than a minute. Have you tried with the conda environment?

cats0212 commented 2 years ago

I have tried with the conda environment.If I use only one gpu,it is all right,but when I use 2 or 4 gpus,it is can not successful train.I don't know what the reason is.Did you use 2 or more gpus for training?thank you

cats0212 commented 2 years ago

But if I only train a few objects , I can use 2 or 4 gpus to train. This bug is very confusing.

rasmushaugaard / surfemb

Wait at 0% while training.Epoch 0: 0% 0/13000 #4