regel / loudml

Loud ML is the first open-source AI solution for ICT and IoT automation
Other
295 stars 93 forks source link

[Trainning Failed] hyperopt.exceptions.AllTrialsFailed #391

Open toni-moreno opened 4 years ago

toni-moreno commented 4 years ago

Hello again , and sorry for the inconvenience When trying to re-train one of my models, sometimes they are failing with this message, and I can not understand what exact it means.

Any idea? thank you in advance!

INFO:root:job[c7aea5d6-7d13-4f99-af9f-3034517f592f] starting, nice=5
INFO:root:connecting to influxdb on influxdb:8086, using database 'loudml'
INFO:root:train(swarm@cpu@mean@usage_active@host_worker2_cpu_cpu-total@time@5m) range=2020-08-05T12:50:00.000Z-2020-08-06T12:55:00.000Z train_size=0.670000 batch_size=64 epochs=100)
INFO:root:connecting to influxdb on influxdb:8086, using database 'swarm'
INFO:root:missing data: field 'usage_active', metric 'mean', bucket: 2020-08-05T19:15:00Z
INFO:root:found 289 time periods
100%|██████████| 10/10 [00:00<00:00, 66.66it/s, best loss: ?]
INFO:hyperopt.tpe:tpe_transform took 0.003570 seconds
INFO:hyperopt.tpe:TPE using 0 trials
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.003187 seconds
INFO:hyperopt.tpe:TPE using 1/1 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002791 seconds
INFO:hyperopt.tpe:TPE using 2/2 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.006578 seconds
INFO:hyperopt.tpe:TPE using 3/3 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.003134 seconds
INFO:hyperopt.tpe:TPE using 4/4 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002731 seconds
INFO:hyperopt.tpe:TPE using 5/5 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002899 seconds
INFO:hyperopt.tpe:TPE using 6/6 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.016221 seconds
INFO:hyperopt.tpe:TPE using 7/7 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002902 seconds
INFO:hyperopt.tpe:TPE using 8/8 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002827 seconds
INFO:hyperopt.tpe:TPE using 9/9 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
ERROR:root:
Traceback (most recent call last):
  File "/opt/venv/lib/python3.7/site-packages/loudml/worker.py", line 53, in run
    res = getattr(self, func_name)(*args, **kwargs)
  File "/opt/venv/lib/python3.7/site-packages/loudml/worker.py", line 101, in train
    **kwargs
  File "/opt/venv/lib/python3.7/site-packages/loudml/donut.py", line 1091, in train
    abnormal=abnormal,
  File "/opt/venv/lib/python3.7/site-packages/loudml/donut.py", line 843, in _train_on_dataset
    rstate=fmin_state,
  File "/opt/venv/lib/python3.7/site-packages/hyperopt/fmin.py", line 403, in fmin
    show_progressbar=show_progressbar,
  File "/opt/venv/lib/python3.7/site-packages/hyperopt/base.py", line 651, in fmin
    show_progressbar=show_progressbar)
  File "/opt/venv/lib/python3.7/site-packages/hyperopt/fmin.py", line 426, in fmin
    return trials.argmin
  File "/opt/venv/lib/python3.7/site-packages/hyperopt/base.py", line 600, in argmin
    best_trial = self.best_trial
  File "/opt/venv/lib/python3.7/site-packages/hyperopt/base.py", line 591, in best_trial
    raise AllTrialsFailed
hyperopt.exceptions.AllTrialsFailed
10.0.0.7 - - [2020-08-06 14:53:12] "GET /models/swarm@cpu@mean@usage_active@host_worker2_cpu_cpu-total@time@5m HTTP/1.1" 200 878 0.014592
10.0.0.7 - - [2020-08-06 14:53:13] "POST /models/swarm@cpu@mean@usage_active@host_worker2_cpu_cpu-total@time@5m/_train?flag_abnormal_data=true&from=now-1d&output_bucket=test-loudml&save_output_data=true&to=now HTTP/1.1" 202 153 0.012024
ERROR:root:job[c7aea5d6-7d13-4f99-af9f-3034517f592f] failed: 
regel commented 4 years ago

AllTrialsFailed: all ten (default) attempts to optimize and find a good fit have failed. Usually happens when the time window contains gaps and missing points.

Ps: thanks for all the feedback Toni! Very much appreciated !