google / schedviz

A tool for gathering and visualizing kernel scheduling traces on Linux machines
Apache License 2.0
519 stars 34 forks source link

Upload trace fails because traces are misordered #14

Open david-laight opened 4 years ago

david-laight commented 4 years ago

I've been getting 'inference error (CPU)' trying to load some traces. Having written a program to read and dump (and adjust) the raw trace data I think I know what is happening.

A process (pid 3857) in the trace (I'll try to attach) has run a few times on cpu2. At (raw) time 39866b80c873 it is migrated to cpu1 (stolen by) and runs. At 39866c19166e cpu1 migrates it back to cpu2 and then enables it. At 39866c1a093f it executes on cpu2. These are the only 'migrates' for pid 3857 in the trace, so must be relevant.

The file upload fails because the process state is 'unknown' on cpu2 @179948522 having gone 'sleeping' on cp1 @179792871.

Digging through the *.go files I think I understand the problem. The large absolute timestamps are converted to small values in setTimestampNormalize(). This code subracts off the timestamp of the first real trace entry from all the others. However it is only (AFAICT) looking at the data from a single cpu. So it subtracts different amounts off the timestamps from the different files. This slight difference doesn't usually make much difference, but when a migrate is done by the cpu the process last ran on immediiadtely prior to being woken; the process schedule trace can end up with a timestamp earlier than the migrate.

Note that the trace files contain a lot of 'pad' entries where they've been 'adjusted' so that they all start at the same time and only contain valid trace. (Oh can you also fix ringBufferEvent() so it doesn't reject 'Padding' with time interval zero - these happen if a pad is added after a 'time extend' entry.)

trace.tar.gz

david-laight commented 4 years ago

That analysis isn't quite right. I can't work out what time offset it is subtracting from each trace file - the epoch doesn't seem to line up at all. Looking further it seems to normalize the timestamp to the first 'unclipped' event. That is just bollocks. If one of the raw event headers has 'clipped' set then all the earlier events should be deleted. (The kernel seems to return a 'junk' page at the start of the raw data.) If I delete the 'clipped' flag I get closer to loading a valid file (all the unwanted trace is pad).

david-laight commented 4 years ago

Something very strange is going on - see screenshot. The data for cpu:2 at time 0 should be (almost) 300ms later but the events have been given timestamps relative to the start of the raw data page - ignoring the intervals of all the 'pad' entries between the start of the raw page and the first actual trace. cpu 1 and 3 have similarly structured files. cpu 0 data starts at the beginning of the 2nd page. (The first page of all 4 traces is all 'pad' entries because the 'clipped' flag is set at the beginning of the 2nd page.)

Screenshot_2020-01-08 SchedViz

david-laight commented 4 years ago

Trace file for above:

trace.Screenshot_2020-01-08.tar.gz

ilhamster commented 4 years ago

Hi, David,

I'll try to address your questions and discussion, but tl;dr: here https://github.com/google/schedviz/blob/master/server/fs_storage.go#L110, change the NewCollection call to:

collection, err := sched.NewCollection(

collectionProto.EventSet,

sched.UsingEventLoaders(sched.FaultTolerantEventLoaders()), sched.NormalizeTimestamps(true))

(that is, adding the sched.UsingEventLoaders(sched.FaultTolerantEventLoaders()) collection option) and your collection will load.

Now, to your questions and comments, and some background:

  1. You discuss timestamp normalization. That's a step much later in the pipeline, and is unrelated to inference errors like this. The idea there is that in many cases event timestamps are in a fairly useless epoch (e.g., nanoseconds since boot). Timestamp normalization simply shifts all events' timestamps so that the trace starts at 'time 0'.
  2. Clipping is something we do internally, but don't yet have externally; this lack may be the source of your problem. As you probably know, in the kernel, each CPU streams its events to its own individual buffer, whose size is specified at trace configuration. When these buffers overflow, we can do one of two things: overwrite the oldest records (which occurs in flight-data-recorder mode), or stop emitting new events (which occurs in one-shot mode). In either case, we lose information for the overflowed CPU, either at the start of the trace (in the former case) or at the end of the trace (in the latter.) Since scheduling is a whole-system phenomenon, we cannot tolerate such lost data, and therefore we must correct for these overflows. This is done by 1) identifying which CPU buffers overflowed; 2) finding the latest (for FDR-mode, earliest for one-shot) first (last) event among these buffers, and 3) marking all events, in any CPU buffer, earlier (later) than that event's timestamp as 'clipped'. Clipped events will then not be considered by SchedViz. Again, I don't think we're doing this in the external SchedViz; it's certainly something we'd like to do and would welcome contributions on.
  3. It looks like the kernel's own trace reader ignores pad events https://github.com/torvalds/linux/blob/master/kernel/trace/ring_buffer.c#L3649 .
  4. These scheduling events are not entirely reliable. For example, we frequently get multiple wakeup events for the same sleeping thread, and sometimes wakeups on running threads. There are also some race conditions in the kernel that can apparently reorder events so that their ordering is incoherent (e.g., a thread wakes up on CPU B, then appears to migrate from CPU A to B). This phenomenon is exacerbated as more event types are collected, since the problematic mechanism is hit when an event is emitted while another event is being emitted. To handle this, we've introduced a variety of event loaders https://github.com/google/schedviz/blob/master/analysis/sched_event_loaders.go. The default event loader tolerates duplicate wakeups and wakeup-while-running, since this is a really common issue, but anything else is rejected with an inference error. But there are two others provided: the fault tolerant loader, which allows problematic migrates to be dropped, and a switch-only loader, which ignores all events but switches. These represent increasingly permissive event-stream interpretation, with increasing uncertainty (though typically the fault-tolerant loader only needs to drop O(10) events, out of millions, to make the trace coherent.)

It looks like your per-CPU buffers are really small (20KiB) in your trace, which means that buffer overflow is very likely, and the lack of clipping problematic. However, the good news is that the fault-tolerant loader is generally able to work around the lack of clipping, hence you can use it to load your traces. You might also try increasing the per-CPU buffer size (or decreasing trace duration) until overflow is uncommon. Internally, we mostly use 8MiB per-CPU buffers, and very seldom overflow even on 2s collections; the time-to-overflow of course depends on the system load and the amount of scheduling thrash, but traces under 10s very rarely overflow at all.

Lee

On Wed, Jan 8, 2020 at 4:41 AM david-laight notifications@github.com wrote:

I've been getting 'inference error (CPU)' trying to load some traces. Having written a program to read and dump (and adjust) the raw trace data I think I know what is happening.

A process (pid 3857) in the trace (I'll try to attach) has run a few times on cpu2. At (raw) time 39866b80c873 it is migrated to cpu1 (stolen by) and runs. At 39866c19166e cpu1 migrates it back to cpu2 and then enables it. At 39866c1a093f it executes on cpu2. These are the only 'migrates' for pid 3857 in the trace, so must be relevant.

The file upload fails because the process state is 'unknown' on cpu2 @179948522 having gone 'sleeping' on cp1 @179792871.

Digging through the *.go files I think I understand the problem. The large absolute timestamps are converted to small values in setTimestampNormalize(). This code subracts off the timestamp of the first real trace entry from all the others. However it is only (AFAICT) looking at the data from a single cpu. So it subtracts different amounts off the timestamps from the different files. This slight difference doesn't usually make much difference, but when a migrate is done by the cpu the process last ran on immediiadtely prior to being woken; the process schedule trace can end up with a timestamp earlier than the migrate.

Note that the trace files contain a lot of 'pad' entries where they've been 'adjusted' so that they all start at the same time and only contain valid trace. (Oh can you also fix ringBufferEvent() so it doesn't reject 'Padding' with time interval zero - these happen if a pad is added after a 'time extend' entry.)

trace.tar.gz https://github.com/google/schedviz/files/4035275/trace.tar.gz

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/schedviz/issues/14?email_source=notifications&email_token=AA27XBYNGIGQQLDCWICYNNTQ4XCYBA5CNFSM4KEHX7V2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEYCBJA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB4E56FZCSCKKDVH5ADQ4XCYBANCNFSM4KEHX7VQ .

david-laight commented 4 years ago

I need to load traces that overflow. I intend running tests that take hours and trigger the trace to stop when there are excessive schedling delays. Since the test is processing real-time audio 100ms of trace is a long time! I will use a slightly longer trace, but I'm testing the loading of wrapped traces.

When I've dumped the raw trace files with my own decoder, I don't see any missing events or wakeups mis-sequenced against migrates.

If I (well my program!) look through all the trace files and find the timestamp of the first trace after the wrap (typically the first trace in the second 4k page), and take the latest of those timestamps across all the cpus. I then want to discard all the entries before that timestamp - since the trace after that point is complete. If I just change all the entries themselves to 'pad' ones the trace still fails to load. Zeroing the header flags byte doesn't help. Changing the page header timestamp to a few ns before trace timestamp and using a single pad to delete all the unwanted traces gives me a trace file that loads.

So I'm fairly certain that something is wrong with the logic that timestamps the trace entries and then merges them into a single trace.

I can probably debug it, but the instructions I have for building and running it aren't very informative about what they are actually doing. What do I have to rerun to rebuild the code that is run when the browser uploads a file? I'm guessing I don't need to restart the webserver after such a change.

ilhamster commented 4 years ago

Did you try the code change I suggest at the start of my response? That works for me.

On Thu, Jan 9, 2020, 05:55 david-laight notifications@github.com wrote:

I need to load traces that overflow. I intend running tests that take hours and trigger the trace to stop when there are excessive schedling delays. Since the test is processing real-time audio 100ms of trace is a long time! I will use a slightly longer trace, but I'm testing the loading of wrapped traces.

When I've dumped the raw trace files with my own decoder, I don't see any missing events or wakeups mis-sequenced against migrates.

If I (well my program!) look through all the trace files and find the timestamp of the first trace after the wrap (typically the first trace in the second 4k page), and take the latest of those timestamps across all the cpus. I then want to discard all the entries before that timestamp - since the trace after that point is complete. If I just change all the entries themselves to 'pad' ones the trace still fails to load. Zeroing the header flags byte doesn't help. Changing the page header timestamp to a few ns before trace timestamp and using a single pad to delete all the unwanted traces gives me a trace file that loads.

So I'm fairly certain that something is wrong with the logic that timestamps the trace entries and then merges them into a single trace.

I can probably debug it, but the instructions I have for building and running it aren't very informative about what they are actually doing. What do I have to rerun to rebuild the code that is run when the browser uploads a file? I'm guessing I don't need to restart the webserver after such a change.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/schedviz/issues/14?email_source=notifications&email_token=AA27XBZWXUOJGUVATDLN7YTQ44UDPA5CNFSM4KEHX7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIQMABI#issuecomment-572571653, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB2ZTMCTDNSVCWL7BCTQ44UDPANCNFSM4KEHX7VQ .

david-laight commented 4 years ago

Is the foo.go code interpreted? I've never looked at any before.

If that change just allows the process state transition through it is just papering over the cracks in the code that parses the traces. There are no missing (or mis-sequenced) traces in the files I'm using when I decode tham with a little program I wrote.

ilhamster commented 4 years ago

It is compiled. You'll need to rebuild and rerun.

I don't believe it is 'papering over the cracks'. I believe its results are correct, and that it's dropping only those events that don't work with the preponderance of evidence from all events.

If you'd like to use your own trace interpreter to generate processed traces to view in SV, this is quite possible; look at how we load traces from ebpf for an example.

On Thu, Jan 9, 2020, 06:44 david-laight notifications@github.com wrote:

Is the foo.go code interpreted? I've never looked at any before.

If that change just allows the process state transition through it is just papering over the cracks in the code that parses the traces. There are no missing (or mis-sequenced) traces in the files I'm using when I decode tham with a little program I wrote.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/schedviz/issues/14?email_source=notifications&email_token=AA27XB7ZQE35JPDO5UCGZ2TQ44Z5VA5CNFSM4KEHX7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIQRBTI#issuecomment-572592333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB6337SBLWGJ6OUI3CTQ44Z5VANCNFSM4KEHX7VQ .

david-laight commented 4 years ago

I've managed to add some trace and found one of the problems. When I try to delete unwanted events by converting them to 'pad' the time deltas aren't added in. The library code in linux/tools/lib/traceevent/kbuffer-parse.c (which is what I based my code on) assumes that the high 27 bits are a time delta for all trace types. Whereas the schedviz code has some strange comments about the time delta od pads being zero/non-zero means something, it also fails to add in the time delta for pad entries. Always adding in the time delta would simplify the code - especially since I'm 99.999% sure type 31 entries haven't been used for a long, long, long time.

Anyway I've now got a program that deletes the early events from traces that have wrapped.

An extra question, If I trace an extra event (and include its format in the tar file) will I be able to enable it as a marker on the 'events' page?

ilhamster commented 4 years ago

You should be able to include non-sched events and visualize them in the viewer.

On Thu, Jan 9, 2020 at 9:16 AM david-laight notifications@github.com wrote:

I've managed to add some trace and found one of the problems. When I try to delete unwanted events by converting them to 'pad' the time deltas aren't added in. The library code in linux/tools/lib/traceevent/kbuffer-parse.c (which is what I based my code on) assumes that the high 27 bits are a time delta for all trace types. Whereas the schedviz code has some strange comments about the time delta od pads being zero/non-zero means something, it also fails to add in the time delta for pad entries. Always adding in the time delta would simplify the code - especially since I'm 99.999% sure type 31 entries haven't been used for a long, long, long time.

Anyway I've now got a program that deletes the early events from traces that have wrapped.

An extra question, If I trace an extra event (and include its format in the tar file) will I be able to enable it as a marker on the 'events' page?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/schedviz/issues/14?email_source=notifications&email_token=AA27XB5UWQ4R5NGTLUHS543Q45LWZA5CNFSM4KEHX7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIRCGPI#issuecomment-572662589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA27XB5SJO5JU4AUIZ6J5BTQ45LWZANCNFSM4KEHX7VQ .

sabarabc commented 4 years ago

The "strange comment" I think you are referring to comes from the kernel: https://github.com/torvalds/linux/blob/b07f636fca1c8fbba124b0082487c0b3890a0e0c/include/linux/ring_buffer.h#L24-L31

The time_delta doesn't appear to be used for padding events when iterating over events in the kernel (this is the code we looked at when writing the SchedViz parser): https://github.com/torvalds/linux/blob/b07f636fca1c8fbba124b0082487c0b3890a0e0c/kernel/trace/ring_buffer.c#L3641-L3649

david-laight commented 4 years ago

I was looking at this https://github.com/torvalds/linux/blob/master/tools/lib/traceevent/kbuffer-parse.c#L345 which is intended to be library code that applications can use to read the raw trace.

It does't bother to parse absolute time stamps - I suspect they are never used (it skips the wrong length!).

The kernel probably never writes 'pad' entries (except at the end of trace blocks) so knows the time update is irrelevant.

If 'pad' can be used for deleted items then the timestamp probably needs to be processed.

david-laight commented 4 years ago

Just FYI attached is a screenshot of a 'kworker/u8:3' processes running without allowing pre-emption for 3.1ms. Unfortunately the name is badly truncated (actually it isn't, but it isn't useful) I think it is scheduled from 'Xorg'. Screenshot_2020-01-10 SchedViz

Since I'm running with threads pinned to cpus (maybe not such a good idea unless they are reserved at boot time) everything is delayed.