risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer instant failover, dynamic scaling, speedy bootstrapping, and efficient joins.
https://www.risingwave.com/slack
Apache License 2.0
6.59k stars 538 forks source link

observability: make logs structural #10426

Open BugenZhao opened 1 year ago

BugenZhao commented 1 year ago

Currently when printing logs with tracing, we tend to format the runtime information into the message like this:

https://github.com/risingwavelabs/risingwave/blob/4f3bce2d024f958d1eeebc49a5556a81372fe824/src/meta/src/rpc/election_client.rs#L107-L122

Although intuitive, this approach lacks observability friendliness. It is hard to retrieve the lease_id from these messages after they're formatted, which makes it also challenging for us to build some analytics or alerting tools by consolidating related logs, traces, and metrics.

Instead, we should prefer structural logs instead. This is enabled by "recording fields" with tracing. It looks like...

https://github.com/risingwavelabs/risingwave/blob/4f3bce2d024f958d1eeebc49a5556a81372fe824/src/utils/pgwire/src/pg_protocol.rs#L345-L352

By explicitly specifying fields, the tracing subscriber will be able to decide how to handle them.

cc @fuyufjh @arkbriar @Nebulazhang

liurenjie1024 commented 1 year ago

So this means we need to refactor all logs?

BugenZhao commented 1 year ago

So this means we need to refactor all logs?

I think so, at least for those useful for debugging.

fuyufjh commented 1 year ago

+1

Using different log config for multiple deploying env is quite common.

So this means we need to refactor all logs?

From my experience, it's almost impossible to enforce every log to be structured. However, we can enforce it for the most critical logs and newly-added ones. It still helps.

arkbriar commented 1 year ago

From my experience, it's almost impossible to enforce every log to be structured. However, we can enforce it for the most critical logs and newly-added ones. It still helps.

Agree. It's more feasible to start on critical ones and then refactor the others gradually.

Besides, structured logs make everything based on logs easier, such as analysis on SQL logs.

arkbriar commented 1 year ago

From my experience, it's almost impossible to enforce every log to be structured. However, we can enforce it for the most critical logs and newly-added ones. It still helps.

Agree. It's more feasible to start on critical ones and then refactor the others gradually.

Besides, structured logs make everything based on logs easier, such as analysis on SQL logs.

But I'm concern about if the log collector/store recognizes hybrid format since we won't unify all the logs at once. Do you have any idea? @Nebulazhang

BugenZhao commented 1 year ago

The format won't be "hybrid". They are always in JSON if the env var RISINGWAVE_CLOUD is set after #10633.

For example, if a log event is structural, we may get something like...

tracing::debug!(actor_id, ?state, "take snapshot")
{
    "timestamp": "2023-06-28T17:17:35.404904+08:00",
    "level": "DEBUG",
    "fields": {
        "message": "take snapshot",
        "actor_id": 42, <- 👀
        "state": "[Nexmark(NexmarkSplit { split_index: 2, split_num: 4, start_offset: Some(315394) })]" <- 👀
    },
    "target": "risingwave_stream::executor::source::source_executor"
}

But if it's simply formatted into the message, we'll only get the message field.

tracing::debug!("take snapshot for actor {} with state: {:?}", actor_id, state)
{
    "timestamp": "2023-06-28T17:17:35.404904+08:00",
    "level": "DEBUG",
    "fields": {
        "message": "take snapshot for actor 42 with state: [Nexmark(NexmarkSplit { split_index: 2, split_num: 4, start_offset: Some(315394) })]"
    },
    "target": "risingwave_stream::executor::source::source_executor"
}

Although it may be harder to analyze these unstructural logs, this format should still be considered consistent with the structural one.

xxchan commented 11 months ago

Are there any actionable sub-issues?

BugenZhao commented 11 months ago

Are there any actionable sub-issues?

I have no idea currently. Maybe we need some practices of log analyzing in production to determine what needs prioritization.

github-actions[bot] commented 3 days ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

xxchan commented 2 days ago

No actionable tasks now. BTW, we can add a section in the developer guide https://github.com/risingwavelabs/risingwave/issues/17436