scalyr / scalyr-agent-2

The source code for Scalyr Agent 2, the daemon process Scalyr customers run on their servers to collect metrics and logs.
Apache License 2.0
70 stars 59 forks source link

[2.1.40] AttributeError: 'AddEventsTask' object has no attribute '_CopyingManagerWorkerSession__receive_response_status' #1258

Closed ju2wheels closed 4 months ago

ju2wheels commented 6 months ago

The scalyr agent hits the exception below and stops logging to scalyr:

2024-04-15 15:50:17.042Z ERROR [core] [scalyr_agent.scalyr_logging:707] Failed while attempting to scan and transmit logs :stack_trace:
  Traceback (most recent call last):
    File "/usr/share/scalyr-agent-2/py/scalyr_agent/copying_manager/worker.py", line 494, in run
      in self.__pending_add_events_task.__receive_response_status
  AttributeError: 'AddEventsTask' object has no attribute '_CopyingManagerWorkerSession__receive_response_status'

We overlay the following configuration files into the scalyr/scalyr-agent-docker-syslog:2.1.40-alpine Docker image:

$ cat etc/scalyr-agent-2/agent.json 
// Configuration for the Scalyr Agent while running on Docker. For help:
//
// https://www.scalyr.com/help/scalyr-agent-2

{
    // Note:  It is assumed that another file such as `agent.d/api-key.json`
    // will contain the api key for the user's Scalyr account.

    // No need for system and agent monitors.  The docker plugin will gather
    // metrics on the container running the agent.
    "implicit_metric_monitor": false,
    "implicit_agent_process_metrics_monitor": false,
    "compressionType": "bz2"
}

$ cat etc/scalyr-agent-2/agent.d/api_key.json 
{
  "import_vars": ["SCALYR_API_KEY"],
  "api_key": "$SCALYR_API_KEY"
}

$ cat etc/scalyr-agent-2/agent.d/rsyslog.json 
{
  "monitors": [
    {
      "module":                    "scalyr_agent.builtin_monitors.syslog_monitor",
      "protocols":                 "tcp:602",
      "accept_remote_connections": true,
      "message_size_can_exceed_tcp_buffer": true,
      "stop_agent_on_failure": true
    }
  ]
}
$ cat etc/scalyr-agent-2/agent.d/server_attributes.json 
{
  "import_vars": ["SCALYR_DEVICE", "SCALYR_DMA", "SCALYR_DEVICE_DOMINANCE", "SCALYR_DEVICE_METADATA_GROUPS", "SCALYR_ENVIRONMENT", "SCALYR_GLOBAL_PREFIX", "SCALYR_HHID", "SCALYR_HOST_NAME", "SCALYR_PANEL"],
  "server_attributes": {
    "device": "$SCALYR_DEVICE",
    "dma": "$SCALYR_DMA",
    "device_metadata_groups": "$SCALYR_DEVICE_METADATA_GROUPS",
    "dominance":"$SCALYR_DEVICE_DOMINANCE",
    "environment": "$SCALYR_ENVIRONMENT",
    "globalPrefix" :"$SCALYR_GLOBAL_PREFIX",
    "hhId": "$SCALYR_HHID",
    "panel": "$SCALYR_PANEL",
    "serverHost": "$SCALYR_HOST_NAME"
  }
}

$ cat etc/scalyr-agent-2/agent.d/logs.json 
{
  "logs": [
    {
      "path": "/var/log/scalyr-agent-2/agent.log",
      "attributes": {
        "parser": "scalyrAgentLog"
      }
    },
    {
      "path": "/var/log/scalyr-agent-2/agent_syslog.log",
      "attributes": {
        "parser": "agentSyslog"
      }
    },
    {
      "path": "/var/log/scalyr-agent-2/containers/*.log",
      "attributes": {
        "parser": "agentSyslogDocker"
      }
    }
  ],
    "max_line_size": 64000
}
$ cat etc/scalyr-agent-2/agent.d/docker.json 
{
  "monitors": [
    {
      "docker_regex": "^.*([a-z0-9]{12})\\[\\d+\\]: ",
      "mode":         "docker",
      "module":       "scalyr_agent.builtin_monitors.syslog_monitor",
      "protocols":    "tcp:601",
      "message_size_can_exceed_tcp_buffer": true,
      "stop_agent_on_failure": true
    },
    {
      "module":   "scalyr_agent.builtin_monitors.docker_monitor",
      "log_mode": "syslog",
      "report_container_metrics": false,
      "stop_agent_on_failure": true
    }
  ]
}

We had hoped that by setting "stop_agent_on_failure": true this would have caused the scalyr agent to exit entirely and thus causing the Docker container to exit and be restarted according to the Docker settings it what started with. Instead what we see is the Docker container continues to run and all log forwarding stops. What can we do here to force the scalyr agent service to exit on these types of errors? We have been having this issue sporadically for over a year on older versions as well even before stop_agent_on_failure became available.

weilliu commented 6 months ago

@ju2wheels The support team will review the issue and open a ticket in the support portal for ongoing troubleshooting conversation.

ericla413 commented 6 months ago

One added note about this, when a system starts to get this error it then happens repeatedly, 10s of thousands of times over days until the server is restarted.

These systems are in places with varying and at times unreliable home networks. In any given month perhaps 1% of the systems may have this issue but it's often different ones.

jmakar-s1 commented 4 months ago

Appreciate your patience with this, it is now fixed and will be included in the next release

ju2wheels commented 2 months ago

@jmakar-s1 Any update on when this will be included in a release for the Docker image of scalyr-agent2 so that we can test this fix?

jmakar-s1 commented 2 months ago

It will be part of the 2.2.17 release estimated date is Sep 2, cc @alesnovak-s1

ju2wheels commented 2 months ago

@jmakar-s1 ok, thank you