Log collection and aggregation

felipemontoya commented 1 year ago

During the latest meeting we reviewed @gabor-boros answer at #26. Most missing features had a ticket covering them, but log collection did not.

The situation is:

there are some tools to collect and aggregate logs inside of a namespace where tutor is already installed. Logstash and Vector are common alternatives.
in the umbrella portions of the cluster, the charts and pods that run on the global namespace we don't have yet anything for log collection.

The question remains open if we want/need a specific tool for that and if there is interest in the participants of this repo in building one.

On the plus side we could have a tool that makes handling many instances simples. The con is that we would be splitting the effort that could otherwise go into making the tools for log collection an individual namespace better.

I personally have not taken a side for any of the options, but we need a place where it can be discussed.

felipemontoya commented 5 months ago

@Ian2012 I know we are storing logs for some installations that want to start aspects with some data from before redwood. Could you please share in this context how we are doing that?

Ian2012 commented 5 months ago

On production, we are using Vector deployed with a helm chart with a sink configuration that saves all the logs on an S3 bucket splitting the logs per namespace/kind/application. Would that be a suitable solution for this problem?

Eventually, once Aspects is configured we can trigger a job that reads from: <namespace>/tracking/lms|lms-worker tracking logs and does the proper backfill

Ian2012 commented 5 months ago

Another solution that I see feasible is to store the tracking log data into ClickHouse using Vector to have quicker backfills on Aspects and being able to have an out-of-box backups solution for tracking logs. This is nothing new, as Cairn performs a similar operation by storing all tracking logs into ClickHouse via Vector

gabor-boros commented 4 months ago

@bradenmacdonald and @Agrendalath inviting you to this conversation. I think both solutions could be feasible, though you may have better insights here. Especially @Agrendalath as I know one of your clients is using tracking logs.

bradenmacdonald commented 4 months ago

@pomegranited might be a better person to ask :) I don't have much insight on this topic.

pomegranited commented 4 months ago

Hi @felipemontoya, thank you for starting the discussion! I think we need to define some scope and goals before making technology decisions.

Is this about general Open edX log collection/aggregation, like for monitoring instance health and investigating incidents? Or is it just about storing tracking logs?

How much of a solution should we provide? If we're providing log collection, do we need parsing, monitoring, dashboards, and alerting too?

What solutions are people currently using? What are their pain points?

There's a lot to consider. But we can totally take cues from @bmtcril 's Aspects architecture and integrate with suitable open source 3rd party tools, rather than writing our own.

bmtcril commented 4 months ago

FWIW Aspects can store tracking logs in ClickHouse via Vector now, though I'm not sure when the last time was that we tested it.

I definitely agree that having long term, flexible, rotated log storage for both operational and tracking logs (and potentially xAPI logs) is hugely important. I personally wouldn't mind seeing Vector used for that, but I'm sure site operators have much more valuable insight on any pain points with it.

MoisesGSalas commented 4 months ago

I've seen two common patterns when collecting logs in k8s: A sidecar container that runs alongside the application and a DaemonSet that runs on every node and mounts the /var/log/ from the host.

IIRC Adam Blackwell mentioned that they were using the sidecar approach in 2U.

With @Ian2012, we have tested the DaemonSet approach in a couple of clusters. We installed a global helm chart for vector and configured the sinks, sources and transforms.

We retrieve all the logs from certain pods (i.e with the annotation app.kubernetes.io/managed-by=tutor) and Cristhian wrote the transformer to extract the tracking logs. We push all the logs to S3.

We also found that this vector instance can serve multiple purposes, we can extract and push the tracking logs to s3, but we can also push the standard application logs of the openedx services to cloudwatch or even push the logs of other services (ingress-nginx, etc).

I think with a similar approach we can eventually cover most of this:

Is this about general Open edX log collection/aggregation, like for monitoring instance health and investigating incidents? Or is it just about storing tracking logs?

How much of a solution should we provide? If we're providing log collection, do we need parsing, monitoring, dashboards, and alerting too?

openedx / openedx-k8s-harmony

Log collection and aggregation #32