Open toni-moreno opened 4 years ago
Also could be great if each log line could have the jobid which has generated , so we can see errors/stacktraces and identificate which process is being crashing.
In the following lines there is two jobs , one training and predicting the other, no way to know which one is the responsible for the stack trace error for people not knowing the code.
INFO:root:job[9d29c201-4ad1-4d4b-9139-790a011591fd] starting, nice=5
INFO:root:job[b8cf6a90-9726-45f2-b004-5114599b60aa] starting, nice=0
INFO:root:connecting to influxdb on influxdb:8086, using database 'loudml'
INFO:root:predict(swarm@cpu@95percentile@usage_active@host_worker2_cpu_cpu-total@time@5m) range=2020-08-06T12:40:00.000Z-2020-08-06T12:45:00.000Z
INFO:root:train(swarm@cpu@mean@usage_active@host_worker2_cpu_cpu-total@time@5m) range=2020-08-05T12:40:00.000Z-2020-08-06T12:45:00.000Z train_size=0.670000 batch_size=64 epochs=100)
INFO:root:connecting to influxdb on influxdb:8086, using database 'swarm'
INFO:root:missing data: field 'usage_active', metric 'mean', bucket: 2020-08-05T12:40:00Z
INFO:root:missing data: field 'usage_active', metric 'mean', bucket: 2020-08-05T19:15:00Z
INFO:root:found 289 time periods
INFO:hyperopt.tpe:tpe_transform took 0.004817 seconds
INFO:hyperopt.tpe:TPE using 0 trials
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.003048 seconds
INFO:hyperopt.tpe:TPE using 1/1 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002605 seconds
INFO:hyperopt.tpe:TPE using 2/2 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002599 seconds
INFO:hyperopt.tpe:TPE using 3/3 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002649 seconds
INFO:hyperopt.tpe:TPE using 4/4 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002580 seconds
INFO:hyperopt.tpe:TPE using 5/5 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002577 seconds
INFO:hyperopt.tpe:TPE using 6/6 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002516 seconds
INFO:hyperopt.tpe:TPE using 7/7 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002593 seconds
INFO:hyperopt.tpe:TPE using 8/8 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
INFO:hyperopt.tpe:tpe_transform took 0.002615 seconds
INFO:hyperopt.tpe:TPE using 9/9 trials with best loss inf
WARNING:root:iteration failed: insufficient validation data
ERROR:root:
Traceback (most recent call last):
File "/opt/venv/lib/python3.7/site-packages/loudml/worker.py", line 53, in run
res = getattr(self, func_name)(*args, **kwargs)
File "/opt/venv/lib/python3.7/site-packages/loudml/worker.py", line 101, in train
**kwargs
File "/opt/venv/lib/python3.7/site-packages/loudml/donut.py", line 1091, in train
abnormal=abnormal,
File "/opt/venv/lib/python3.7/site-packages/loudml/donut.py", line 843, in _train_on_dataset
rstate=fmin_state,
File "/opt/venv/lib/python3.7/site-packages/hyperopt/fmin.py", line 403, in fmin
show_progressbar=show_progressbar,
File "/opt/venv/lib/python3.7/site-packages/hyperopt/base.py", line 651, in fmin
show_progressbar=show_progressbar)
File "/opt/venv/lib/python3.7/site-packages/hyperopt/fmin.py", line 426, in fmin
return trials.argmin
File "/opt/venv/lib/python3.7/site-packages/hyperopt/base.py", line 600, in argmin
best_trial = self.best_trial
File "/opt/venv/lib/python3.7/site-packages/hyperopt/base.py", line 591, in best_trial
raise AllTrialsFailed
hyperopt.exceptions.AllTrialsFailed
Hello @regel .
Could be great if output log lines had timestamp to see when the output happened.
Y have deployed loudml in a swarm and has been restarted several times in the last hours.
output log doesn't give us information about "when the reboot happened", in this context is difficult correlate loudml events with external problems in the platform.
Thank you very much.