Open rodrigo-albuquerque opened 4 years ago
I'm running into pretty much the exact same issue with my kubernetes deployed version, but with an Elasticsearch bucket.
I initially thought it was a DB thing as well, but no longer think so. I realized there was an error in my bucket configuration, but updating it to be correct didn't change the error at all. I also don't think failing when cancelling a job should have anything to do with the db.
Have you had any luck figuring it out?
@regel any suggestions for next troubleshooting steps?
To add to this, not sure if it's related, but when I shell into the pod in kubernetes, I don't have a user assigned I have no name!@loudml-0:/
. More than likely I just need to refactor the helm chart for 1.6.0, but figured I'd add it in case it helps.
As I mentioned in my previous comment, changing the database info doesn't seem to have had any effect. Is there anyway I can test this to rule it out? I validated credentials are correct, but not sure how to get confirmation loudml is able to access.
Adding my replication to this ticket getting same result as OP:
using the 1.6.0 docker image, I am unable to start or cancel processes without error
Below are all the creation steps for each resource and the output. Stack trace from the logs is below as well. The job will stay in waiting until I try to cancel it. When I try to cancel it, it errors and then stays in canceling until I restart.
Model Creation:
curl -XPOST localhost:8077/models --data-binary "@/home/me/Documents/deny-model.json" -H "Content-Type: application/json"
{
"bucket_interval": "1m",
"default_datasource": "firewall-count-index",
"features": [
{
"default": 0,
"field": "Deny_count",
"metric": "avg",
"name": "avg_denies"
}
],
"grace_period": 0,
"interval": 60,
"max_evals": 10,
"max_threshold": 90,
"min_threshold": 50,
"name": "avg-denied-counts-redo",
"offset": 30,
"span": 5,
"timestamp_field": "@timestamp",
"type": "donut"
}
Bucket Creation:
curl -XPOST localhost:8077buckets --data-binary "@/home/me/Documents/firewallbucket.json" -H "Content-Type: application/json"
{
"addr": "https://k8s-cluster-es.namespace.cluster.local:9200",
"dbuser": "svc_loudml",
"doc_type": "_doc",
"index": "datacount.all-*",
"name": "firewall-count-index",
"type": "elasticsearch",
"use_ssl": true,
"verify_ssl": false
"dbuser_password": ""
}
Train Model
curl -X POST localhost:8077/models/avg-denied-counts-redo/_train?from=now-30d
[1] 119731
internal server error[1]+ Done
Check job to see status
curl localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb | jq
{
"duration": 52.941789,
"id": "e7176a7e-3eca-4c6c-b525-6a61f494f7cb",
"model": "avg-denied-counts-redo",
"result": null,
"start_date": "Tue Sep 1 19:04:09 2020",
"start_timestamp": 1598987049.511435,
"state": "waiting",
"type": "training"
}
stack trace
[2020-09-01 19:04:09,511] ERROR in app: Exception on /models/avg-denied-counts-redo/_train [POST]
Traceback (most recent call last):
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/vendor/lib/python3.5/site-packages/flask_restful/__init__.py", line 269, in error_router
return original_handler(e)
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1820, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/opt/vendor/lib/python3.5/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 932, in model_train
job.start(g_config)
File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 1430, in start
kwargs=self.kwargs,
File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/process.py", line 84, in schedule
self._check_pool_state()
File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state
raise RuntimeError('Unexpected error within the Pool')
RuntimeError: Unexpected error within the Pool
ERROR:loudml.server:Exception on /models/avg-denied-counts-redo/_train [POST]
Traceback (most recent call last):
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/vendor/lib/python3.5/site-packages/flask_restful/__init__.py", line 269, in error_router
return original_handler(e)
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1820, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/opt/vendor/lib/python3.5/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 932, in model_train
job.start(g_config)
File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 1430, in start
kwargs=self.kwargs,
File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/process.py", line 84, in schedule
self._check_pool_state()
File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state
raise RuntimeError('Unexpected error within the Pool')
RuntimeError: Unexpected error within the Pool
127.0.0.1 - - [2020-09-01 19:04:09] "POST /models/avg-denied-counts-redo/_train?from=now-30d HTTP/1.1" 500 156 0.007398
Attempt to cancel job
curl -X POST localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb/_cancel
internal server error
curl localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb | jq
{
"duration": 1229.536151,
"id": "e7176a7e-3eca-4c6c-b525-6a61f494f7cb",
"model": "avg-denied-counts-redo",
"result": null,
"start_date": "Tue Sep 1 19:04:09 2020",
"start_timestamp": 1598987049.511435,
"state": "canceling",
"type": "training"
}
No stack trace is generated when the job fails to cancel.
Job is initially waiting:
Then I try to cancel it:
When I check, it's actually trying to cancel it but it's stuck at cancelling for ever:
For reference:
I suspect it may have something to do with the database? How do I check if it's actually connected and writing to DB?
Externally, I've already confirmed I can access myinfluxdb, I can write data, etc.
I also see this weird Python exception in the logs when I try to run the training:
Any pointers would be appreciated. I also tried nightly, 1.5.0, they all show the same error.