HTTP 500 Internal server error when cancelling a job, new jobs stuck in waiting on Docker

rodrigo-albuquerque commented 4 years ago

> version
1.6.0
> list-models
bah
mymodel

Job is initially waiting:

> list-jobs
id                                    name               state      x/N  time_left  duration
2148faaf-b7b8-436a-82d0-d9b91d74c9ea  training(bah)      waiting                    496.464 <-----
4dd2fb2c-b629-4aad-93a5-ce4df0626121  training(mymodel)  canceling                  1687.226
6eacbedc-c289-4f06-9360-a05e199ac5eb  training(mymodel)  canceling                  1783.785
8f411abc-7a96-440b-b550-505c83f021a5  training(bah)      waiting                    684.152
da80bb65-f1e1-4d42-8c89-1feb49190bf0  training(mymodel)  canceling                  1817.056

Then I try to cancel it:

> cancel-job 2148faaf-b7b8-436a-82d0-d9b91d74c9ea
ERROR:root:TransportError(500, 'internal server error')

When I check, it's actually trying to cancel it but it's stuck at cancelling for ever:

> list-jobs
id                                    name               state      x/N  time_left  duration
2148faaf-b7b8-436a-82d0-d9b91d74c9ea  training(bah)      canceling                  1134.971 <----
4dd2fb2c-b629-4aad-93a5-ce4df0626121  training(mymodel)  canceling                  2325.733
6eacbedc-c289-4f06-9360-a05e199ac5eb  training(mymodel)  canceling                  2422.291
8f411abc-7a96-440b-b550-505c83f021a5  training(bah)      waiting                    1322.659
da80bb65-f1e1-4d42-8c89-1feb49190bf0  training(mymodel)  canceling                  2455.562

For reference:

> list-models
bah
mymodel
> show-model mymodel
- settings:
    bucket_interval: 10s
    default_bucket: influxdb
    features:
    - anomaly_type: low_high
      default: null
      field: average_response_ms
      io: io
      match_all:
      - tag: url
        value: host1.example.com
      measurement: ping
      metric: mean
      name: myfeature
    grace_period: 0
    interval: 60s
    max_evals: 20
    max_threshold: 0
    min_threshold: 0
    name: mymodel
    offset: 10s
    seasonality:
      daytime: false
      weekday: false
    span: 10
    type: donut
  training:
    job_id: eebdaf5e-bdad-4936-a8aa-f8595b078120
    state: failed
>

I suspect it may have something to do with the database? How do I check if it's actually connected and writing to DB?

> list-buckets
influxdb
loudml
> show-bucket loudml
- addr: myinfluxdb:8086
  database: loudml
  dbuser: admin
  name: loudml
  type: influxdb

>

Externally, I've already confirmed I can access myinfluxdb, I can write data, etc.

I also see this weird Python exception in the logs when I try to run the training:

10.103.130.201 - - [2020-08-19 19:02:11] "GET /models HTTP/1.1" 200 1205 0.003390
[2020-08-19 19:02:12,667] ERROR in app: Exception on /models/bah/_train [POST]
Traceback (most recent call last):
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/venv/lib/python3.7/site-packages/flask_restful/__init__.py", line 269, in error_router
    return original_handler(e)
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/venv/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/venv/lib/python3.7/site-packages/loudml/server.py", line 932, in model_train
    job.start(g_config)
  File "/opt/venv/lib/python3.7/site-packages/loudml/server.py", line 1430, in start
    kwargs=self.kwargs,
  File "/opt/venv/lib/python3.7/site-packages/pebble/pool/process.py", line 84, in schedule
    self._check_pool_state()
  File "/opt/venv/lib/python3.7/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state
    raise RuntimeError('Unexpected error within the Pool')
RuntimeError: Unexpected error within the Pool
ERROR:loudml.server:Exception on /models/bah/_train [POST]
Traceback (most recent call last):
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/venv/lib/python3.7/site-packages/flask_restful/__init__.py", line 269, in error_router
    return original_handler(e)
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/venv/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/venv/lib/python3.7/site-packages/loudml/server.py", line 932, in model_train
    job.start(g_config)
  File "/opt/venv/lib/python3.7/site-packages/loudml/server.py", line 1430, in start
    kwargs=self.kwargs,
  File "/opt/venv/lib/python3.7/site-packages/pebble/pool/process.py", line 84, in schedule
    self._check_pool_state()
  File "/opt/venv/lib/python3.7/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state
    raise RuntimeError('Unexpected error within the Pool')
RuntimeError: Unexpected error within the Pool
10.103.130.201 - - [2020-08-19 19:02:12] "POST /models/bah/_train?from=2020-08-19T18:59:00.000Z&to=2020-08-19T18:59:00.000Z HTTP/1.1" 500 156 0.005088
10.103.130.201 - - [2020-08-19 19:02:12] "GET /jobs/4c17ee83-3ebf-419f-8e56-3aa6808b6f42 HTTP/1.1" 404 134 0.000463

Any pointers would be appreciated. I also tried nightly, 1.5.0, they all show the same error.

joshbasho commented 4 years ago

I'm running into pretty much the exact same issue with my kubernetes deployed version, but with an Elasticsearch bucket.

I initially thought it was a DB thing as well, but no longer think so. I realized there was an error in my bucket configuration, but updating it to be correct didn't change the error at all. I also don't think failing when cancelling a job should have anything to do with the db.

Have you had any luck figuring it out?

joshbasho commented 4 years ago

@regel any suggestions for next troubleshooting steps?

To add to this, not sure if it's related, but when I shell into the pod in kubernetes, I don't have a user assigned I have no name!@loudml-0:/. More than likely I just need to refactor the helm chart for 1.6.0, but figured I'd add it in case it helps.

As I mentioned in my previous comment, changing the database info doesn't seem to have had any effect. Is there anyway I can test this to rule it out? I validated credentials are correct, but not sure how to get confirmation loudml is able to access.

Adding my replication to this ticket getting same result as OP:

using the 1.6.0 docker image, I am unable to start or cancel processes without error

Below are all the creation steps for each resource and the output. Stack trace from the logs is below as well. The job will stay in waiting until I try to cancel it. When I try to cancel it, it errors and then stays in canceling until I restart.

Model Creation:

curl -XPOST localhost:8077/models --data-binary "@/home/me/Documents/deny-model.json" -H "Content-Type: application/json"

{
    "bucket_interval": "1m",
    "default_datasource": "firewall-count-index",
    "features": [
         {
           "default": 0,
          "field": "Deny_count",
           "metric": "avg",
           "name": "avg_denies"
         }
    ],
    "grace_period": 0,
    "interval": 60,
    "max_evals": 10,
    "max_threshold": 90,
    "min_threshold": 50,
    "name": "avg-denied-counts-redo",
    "offset": 30,
    "span": 5,
    "timestamp_field": "@timestamp",
    "type": "donut"
}

Bucket Creation:

curl -XPOST localhost:8077buckets --data-binary "@/home/me/Documents/firewallbucket.json" -H "Content-Type: application/json"

{
    "addr": "https://k8s-cluster-es.namespace.cluster.local:9200",
    "dbuser": "svc_loudml",
    "doc_type": "_doc",
    "index": "datacount.all-*",
    "name": "firewall-count-index",
    "type": "elasticsearch",
    "use_ssl": true,
    "verify_ssl": false
    "dbuser_password": ""
}

Train Model

curl -X POST localhost:8077/models/avg-denied-counts-redo/_train?from=now-30d
[1] 119731
internal server error[1]+  Done

Check job to see status

curl localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb | jq
{
    "duration": 52.941789,
    "id": "e7176a7e-3eca-4c6c-b525-6a61f494f7cb",
    "model": "avg-denied-counts-redo",
    "result": null,
    "start_date": "Tue Sep  1 19:04:09 2020",
    "start_timestamp": 1598987049.511435,
    "state": "waiting",
    "type": "training"
}

stack trace

[2020-09-01 19:04:09,511] ERROR in app: Exception on /models/avg-denied-counts-redo/_train [POST]                                                                                                                                      
 Traceback (most recent call last):                                                                                                                                                                                                     
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app                                                                                                                                                  
     response = self.full_dispatch_request()                                                                                                                                                                                            
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1951, in full_dispatch_request                                                                                                                                     
     rv = self.handle_user_exception(e)                                                                                                                                                                                                 
   File "/opt/vendor/lib/python3.5/site-packages/flask_restful/__init__.py", line 269, in error_router                                                                                                                                  
     return original_handler(e)                                                                                                                                                                                                         
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1820, in handle_user_exception                                                                                                                                     
     reraise(exc_type, exc_value, tb)                                                                                                                                                                                                   
   File "/opt/vendor/lib/python3.5/site-packages/flask/_compat.py", line 39, in reraise                                                                                                                                                 
     raise value                                                                                                                                                                                                                        
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request                                                                                                                                     
     rv = self.dispatch_request()                                                                                                                                                                                                       
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request                                                                                                                                          
     return self.view_functions[rule.endpoint](**req.view_args)                                                                                                                                                                         
   File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 932, in model_train                                                                                                                                            
     job.start(g_config)                                                                                                                                                                                                                
   File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 1430, in start                                                                                                                                                 
     kwargs=self.kwargs,                                                                                                                                                                                                                
   File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/process.py", line 84, in schedule                                                                                                                                          
     self._check_pool_state()                                                                                                                                                                                                           
   File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state                                                                                                                               
     raise RuntimeError('Unexpected error within the Pool')                                                                                                                                                                             
 RuntimeError: Unexpected error within the Pool                                                                                                                                                                                         
 ERROR:loudml.server:Exception on /models/avg-denied-counts-redo/_train [POST]                                                                                                                                                          
 Traceback (most recent call last):                                                                                                                                                                                                     
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app                                                                                                                                                  
     response = self.full_dispatch_request()                                                                                                                                                                                            
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1951, in full_dispatch_request                                                                                                                                     
     rv = self.handle_user_exception(e)                                                                                                                                                                                                 
   File "/opt/vendor/lib/python3.5/site-packages/flask_restful/__init__.py", line 269, in error_router                                                                                                                                  
     return original_handler(e)                                                                                                                                                                                                         
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1820, in handle_user_exception                                                                                                                                     
     reraise(exc_type, exc_value, tb)                                                                                                                                                                                                   
   File "/opt/vendor/lib/python3.5/site-packages/flask/_compat.py", line 39, in reraise                                                                                                                                                 
     raise value                                                                                                                                                                                                                        
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request                                                                                                                                     
     rv = self.dispatch_request()                                                                                                                                                                                                       
   File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request                                                                                                                                          
     return self.view_functions[rule.endpoint](**req.view_args)                                                                                                                                                                         
   File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 932, in model_train                                                                                                                                            
     job.start(g_config)                                                                                                                                                                                                                
   File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 1430, in start                                                                                                                                                 
     kwargs=self.kwargs,                                                                                                                                                                                                                
   File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/process.py", line 84, in schedule
 self._check_pool_state()                                                                                                                                                                                                           
   File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state                                                                                                                               
     raise RuntimeError('Unexpected error within the Pool')                                                                                                                                                                             
 RuntimeError: Unexpected error within the Pool                                                                                                                                                                                         
 127.0.0.1 - - [2020-09-01 19:04:09] "POST /models/avg-denied-counts-redo/_train?from=now-30d HTTP/1.1" 500 156 0.007398

Attempt to cancel job

curl -X POST localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb/_cancel
internal server error

curl localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb | jq
{
    "duration": 1229.536151,
    "id": "e7176a7e-3eca-4c6c-b525-6a61f494f7cb",
    "model": "avg-denied-counts-redo",
    "result": null,
    "start_date": "Tue Sep  1 19:04:09 2020",
    "start_timestamp": 1598987049.511435,
    "state": "canceling",
    "type": "training"
  }

No stack trace is generated when the job fails to cancel.

regel / loudml

HTTP 500 Internal server error when cancelling a job, new jobs stuck in waiting on Docker #406