Open sharnoff opened 1 year ago
@lassizci
Can we utilize vector
, which we already have inside the VM, to push logs directly to loki
, which we also already have?
It’s not a good practice from security perspective to have credentials in the virtual machines. Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.
It’s not a good practice from security perspective to have credentials in the virtual machines.
But this would be write-only credentials. In such a case, we can only have DoS because of too many logs, which we can combat on the receiver end.
Another option is to have a separate instance of vector
outside the VM in the pod, configured to pull data from the in-VM instance [1].
Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.
What do you mean? Are you talking about updating credentials?
Or, in general, dependence on the particular observability agent? Such dependence, I believe, we cannot escape.
1: https://vector.dev/docs/reference/configuration/sources/vector/
It’s not a good practice from security perspective to have credentials in the virtual machines.
But this would be write-only credentials. In such a case, we can only have DoS because of too many logs, which we can combat on the receiver end.
If we skip the collector we control, we can not deal with DoS at the receiving end. Postgresql escape would potentially give control over labeling etc.
We also do processing in between collection and sending the logs (relabeling, perhaps metrics from logs, switching between plaintext and json and so on…). Also queueing of the log sending should not happen inside the computes, but in trusted environment.
Lets say our log storage is offline and the compute suspends. That would either mean losing the logs or keeping the compute online for retries.
Another option is to have a separate instance of
vector
outside the VM in the pod, configured to pull data from the in-VM instance [1].
I think what makes the most sense is to write logs to a sovket, provided by the host. Then we can consider the further pipeline as an implementation detail.
Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.
What do you mean? Are you talking about updating credentials?
Updating/rotating the credentials is one thing. Building metrics from the logs, relabeling, adding labels, changing the log collector to something else.
Or, in general, dependence on the particular observability agent? Such dependence, I believe, we cannot escape.
We can switch observability agent rather easily when it runs outside of the virtualmachines. That’s currently possible and I don’t think it makes much sense to make it harder, nor waste customer’s cpu time and memory for running such things.
From discussing with @Omrigan earlier: One simplification we can make is to just get logs from the VM to stdout in neonvm-runner (the container running the VM) — we already have logs collection in k8s, so we can just piggy-back on that, which makes it easier than trying to push the logs to some other place.
Notes from discussion:
We have an occurrence of non-postgres log spam (in this case, oom-killer), which won't be fixed by https://github.com/neondatabase/cloud/issues/8602
https://neondb.slack.com/archives/C03F5SM1N02/p1707489906661529
Occurrence of log interleaving that could potentially be fixed by this, depending how we implement it: https://neondb.slack.com/archives/C03TN5G758R/p1714057349130309
xref https://github.com/neondatabase/cloud/issues/18244 We have customer ask to export the postgres logs to an external service, so they can inspect their own logs themselves (e.g via datadog).
We haven't fully specced that out yet but the assumption so far is that we would be reusing the OpenTelemetry collector that we already deploy to collect metrics, and route the logs through this.
Regarding pushing logs to console / k8s logs: the volume will be too large in some cases, eg if the user cares about pg_audit logs. This will become a bottleneck. Also it will not solve the labeling problem, which we care about for product -- customer only wants their postgres logs, not our own control logs. Better export through the network directly (see point below).
Regarding push/pull and credentials: one option is to have a service running inside the VM that accepts incoming connections, and delivers the logs from the VM through that. Would that solve the problem?
The potential way how we can implement this:
/var/log/neonvm/postgres_exporter.stdout.log
./neonvm/bin/postgres_exporter ... > /var/log/neonvm/postgres_exporter.stdout.log
Alternatively, we stop using busybox init to start our programs, and instead have the neonvm-daemon start everything. Thus it would gain access to stdout/stderr, which is then again forwarded to virtio-serial.
This doesn't cover neondatabase/cloud#18244, but if we implement the above, we can then write a logic which would filter needed postgres logs, and send it to customer's endpoint directly (or with functionless shim), so that customer's logs are never processed outside of the VM.
I have no opinion about how we want to handle dmesg etc, but have a strong opinion we should have postgres handle writing its output somewhere else than stderr (directly into syslog) and collect the postgres log separately.
If we want a copy collected via postgres_exporter this can be done by configuring syslog to fork the data into a 2nd stream besides the network collector.
Motivation
These combine to significantly impair the UX of our observability for VMs.
DoD
kubectl logs
during local developmentImplementation ideas
TODO (various ideas, need to discuss)
Tasks
Other related tasks, Epics, and links
577