Route logs to Stackdriver

cisaacstern commented 2 years ago

is there a particular service or implementation from another project, which we could use as a model?

Based on this post, one free path would be

Create a loggly free account
Set up the loggly syslog agent inside the prefect docker image
Configure python logging to use the syslog agent
Watch the logs roll in!

Originally posted by @rabernat in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/192#issuecomment-923256978

This will require most/all of https://github.com/pangeo-forge/pangeo-forge-gcs-bakery/issues/19 being completed first.

rabernat commented 2 years ago

This will require most/all of #19 being completed first.

Not sure about that. I think it mostly requires changes here:

https://github.com/pangeo-forge/pangeo-forge-bakery-images/blob/main/images/pangeonotebook-2021.07.17_prefect-0.14.22_pangeoforgerecipes-0.5.0/Dockerfile

If the same docker image is used in all the bakeries, this change only needs to be made in one place.

We would also need to customize the python logging here:

https://github.com/pangeo-forge/pangeo-forge-prefect/blob/a63777913757565209eee446cb1a4093de291b4a/pangeo_forge_prefect/flow_manager.py#L69-L77

rabernat commented 2 years ago

From an exchange on 2i2c slack with @yuvipanda

Ryan Abernathey 5:23 PM Tech question for the DevOps gurus. Pangeo Forge needs a central place to put logs from various bakeries spread around the internet. We are thinking about using a logging SAAS to help with this. I read this article on some of the options: quora.com/What-is-the-best-logging-system-for-a-SaaS-startup Has anyone here ever used a logging SaaS platform? Do you have a recommendation? New

yuvipanda 5:27 PM Ryan Abernathey we do this on mybinder.org to produce https://archive.analytics.mybinder.org/. All the various deployments (including ovh and turing on Azure) send logs to GCP's stackdriver, and then we have a pipeline there that sends them off to a public GCP bucket to be exposed to the world. I don't know what these logs are to be used for, so I'm not sure how much you need to care about a querying interface. but for this kinda purpose (just to ingest, store and display), stackdriver has worked very painlessly for us, especially as it is already in a cloud provider we do a lot of our work in

This suggests a simpler path of just routing the logs to stackdriver, which would kind of be the default already on GCP.

yuvipanda commented 2 years ago

https://github.com/jupyterhub/mybinder.org-deploy/blob/0afa836a69ebd2064e9ff6b896ccbc31c4efab4c/mybinder/values.yaml#L145 is the code we use to tell python's logging to send everything to google cloud stackdriver. Works across your local machine / any other system. You can authenticate with a google service account service key, which is a simple, easily shippable JSON file.

If you want to get all of stdout, the path here is perhaps slightly different. Am sure there are lots of suggestions on how to do it on the internet.

cisaacstern commented 2 years ago

Thanks @yuvipanda! We'll let you know how it turns out 😄

sharkinsspatial commented 2 years ago

@cisaacstern @rabernat On the log aggregation side the bakeries internally have pods running Grafana / Loki to manage log aggregation and search https://github.com/pangeo-forge/pangeo-forge-gcs-bakery#to-view-dask-cluster-logs-via-grafana. We currently use tunneling to access these private pod ports but we can confer with @tracetechnical what would be involved with making them publicly accessible. One issue with the Dask/Prefect integration is that each new Flow run is dynamically launching a Dask cluster on a per Flow basis so our log aggregation solution needs to be able to use our labeling tags in order to filter and aggregate logs based on Prefect Flow run id and worker pod id so that logs from a specific worker pod for a specific Flow run attempt can be viewed in order rather than interleaved with logs from other simultaneous flows or other worker pods in the cluster. I'm guessing that Stackdriver may have the ability to use our k8s labels for filtering purposes but that question would be better referred to someone besides me 😄

tracetechnical commented 2 years ago

I would prefer to stick with Loki and native k8s logs unless there is a strong need to deviate from this, as this is by far the most simple solution. And has been very effective in my experience.

cisaacstern commented 2 years ago

I would prefer to stick with Loki and native k8s logs

@tracetechnical, how would you recommend making these logs publicly accessible?

Requiring an admin to copy-and-paste logs onto a GitHub thread, as @sharkinsspatial has needed to do here, will not scale.

IIUC, accessing the Granafa / Loki interface currently requires an admin password (based on the docs which point here)?

tracetechnical commented 2 years ago

This is correct, however, baking a 3rd party syslog tool into a docker image feels wrong to me, versus using the native tools. As far as I know, the solution proposed above would need some kind of credential to feed into this service, which would then mean tying this into the flow run itself or having to fiddle to get it integrated into k8s. Baking the credential itself into the image is a big no-no.

Exposing the logs publicly shouldn't be too hard, but I would advise against doing this with no auth at all due to the possibility of sensitive data leaking out.

rabernat commented 2 years ago

Exposing the logs publicly shouldn't be too hard, but I would advise against doing this with no auth at all due to the possibility of sensitive data leaking out.

We should discuss the tradeoffs here (maybe at our next meeting). In general, we want Pangeo Forge to be as open and transparent as possible. Pangeo-forge-recipes is designed carefully not to log secrets. Recent experience shows that these logs are invaluable for debugging failed recipes. Putting any walls in front of them is going to slow the process for contributors.

yuvipanda commented 2 years ago

In general, we want Pangeo Forge to be as open and transparent as possible. Pangeo-forge-recipes is designed carefully not to log secrets.

We keep the mybinder.org grafana open at grafana.mybinder.org for similar reasons :)

tracetechnical commented 2 years ago

@rabernat The issue is other stuff in the cluster which comes out of the same Loki instance.

I think as a minimum, we ought to see if we can tie Loki down to only report on logs from the pangeo-forge namespace in the cluster, rather than including kube-system and friends. Debugging anything outside that namespace is a sysadmin job anyway.

tracetechnical commented 2 years ago

See https://www.giantswarm.io/blog/grafana-logging-using-loki, item 2. I think it should be fairly trivial to setup drop rules for the system namespaces. The hardest bit will be weaving that config into the existing stuff via helm.

tracetechnical commented 2 years ago

That being said though... The point I hadn't considered earlier was the "bucketisation" of the logs for easy access. Although the Loki solution is vendor agnostic and is K8s native, it means you have a log instance per bakery. I think we need to have a think if this fits long-term, and if not, perhaps we should proceed on the StackDriver front. If @yuvipanda has a proven pipeline that they can share, that feels like it may be a good alternative route as long as the baking in isn't too painful and can work easily across all the bakeries.

The main benefit we have seen so far with Loki is the ease of compound searching and suchlike. @sharkinsspatial can speak more on this point as he has actually used it.

rabernat commented 2 years ago

I think there is a difference in use case between Pangeo Forge and MyBinder. AFAICT, MyBinder makes the logs available on a daily basis to allow people to do analytics on them.

For Pangeo Forge, our use for the logs is real time feedback for debugging failed recipes. When a user tests or runs a full recipe, and something goes wrong, they need to see the error as quickly as possible (similar to with CI). Imagine if you could only see the outputs of your GitHub workflows 24 hours later...It would be impossible to debug anything. So batched log aggregation will not work here.

From my point of view it doesn't matter if the logs are spread out over different bakeries. All that maters is that the users can quickly find the logs for their recipe runs. This could be achieved via the appropriate callbacks from the bakery to GitHub (e.g. bot posting a URL where the logs can be found). A more fancy solution would involve an intermediary, such as the API proposed in https://github.com/pangeo-forge/roadmap/pull/31.

pangeo-forge / pangeo-forge-gcs-bakery

Route logs to Stackdriver #25