Infinited loop detected when debugging tainting with llvm debug build

Sisyph commented 6 years ago

principal failure: abort, infinite loop detected

changing llvm build config to release+asserts from debug+asserts removes failure

only clue, in main loop of cpu-exec.c, num_instr_before_next_interrupt = 0, but on next iteration cpu_handle_interrupt does nothing. Looks like interrupt number is 2

loop threshold of 20 instruction repetitions not really relevant, it goes at least 100K

llvm commit a1e338368f5df797f3a1da81adac794340e30d11 debug with asserts build

panda master HEAD as of 7/24/18 debug build

taint2 makefile -DTAINT2_DEBUG

Recordings and logs available as 73MB zip on request analysisscript.txt machconf.txt

m000 commented 6 years ago

I'm also getting an infinite loop message. In my case the message appears even when no plugins are run. It looks like this:

$ ./build-panda/i386-softmmu/qemu-system-i386 -monitor stdio -m 512 -os linux-32-ubuntu:4.4.0-130-generic -replay ubuntu16-test

PANDA[core]: os_familyno=2 bits=32 os_details=ubuntu:4.4.0-130-generic
QEMU 2.8.50 monitor - type 'help' for more information
(qemu) loading snapshot
... done.
opening nondet log for read :   ./ubuntu16-test-rr-nondet.log
rr_guest_instr_count = 0
qemu-system-i386: /home/mstamat/spammer/panda/cpu-exec.c:695: detect_infinite_loops: Assertion `false' failed.

Are you getting a similar message, or did we hit different bugs?

The replay I am using can be found here: https://surfdrive.surf.nl/files/index.php/s/Nk204ShtxQ3Q5FX

The replay completed fine in another branch of my private repo. I did a git log panda/src/rr in the working and non-working branches. I've included the output in rr-log-working.txt and rr-log-notworking.txt.

@moyix , were the changes introduced by the delta of the commits meant to break older replays?

@nathanjackson , any obvious fixes?

moyix commented 6 years ago

I did notice that there was a breaking change with older replays but I haven't bisected it yet – if you have a second and feel like trying that it should be pretty quick (and automatable).

My quick guess is it has to do with the VAPIC bugfix, though that might be too long ago to be relevant: https://github.com/panda-re/panda/pull/284

m000 commented 6 years ago

Copying my description of the problem from #329 .

It seems that the changes introduced to cpu-exec.c by ea8afa841f4228c91cbb2bdcd00bf3c36046259e are responsible for the problem.

I am using the LLVM deb packages from @phulin 's PPA. However, I doubt that this has to do anything with the bug (cpu-exec.c is not touched by LLVM).

I can reproduce the case with the trace I have uploaded here: https://surfdrive.surf.nl/files/index.php/s/Nk204ShtxQ3Q5FX

I have added these two print statements at the top of int cpu_exec(CPUState *cpu):

fprintf(stderr, "rr_guest_instr_count = %lu\n", rr_get_guest_instr_count());
fprintf(stderr, "rr_num_instr_before_next_interrupt = %lu\n", rr_num_instr_before_next_interrupt());

They both print 0. As a result, cpu_loop_exec_tb() which progresses execution will never be executed:

if (!rr_in_replay() || until_interrupt > 0) {
...
cpu_loop_exec_tb(cpu, tb, &last_tb, &tb_exit, &sc);
...
}

Replaying works fine with an older branch I have that doesn't include these modifications to cpu-exec.c

m000 commented 6 years ago

@moyix Indeed, ea8afa841f4228c91cbb2bdcd00bf3c36046259e is part of #284. But I'm not sure what would be an appropriate fix.

@Sisyph, if I understand correctly, for you the infinite loop manifests halfway through the replay?

For my case replay jumps right into the infinite loop (rr_get_guest_instr_count() = 0).

nathanjackson commented 6 years ago

Unfortunately nothing about this will be obvious\easy due to the nature of the problems that I fixed (or at least attempted to fix... based on the comments above that remains to be seen). Even coming up with the changes that I did took quite a bit of time using Mozilla RR and stepping through the code.

@m000 are your recordings pre-#284?

I'm going from memory here as I'm not in the office this week so I don't have my notes in front of me. For #284 I had tested against Windows 2000 and XP pretty much exclusively during development. Before I did the PR I tested Linux, various versions of Windows, and an ARM variant of Alpine Linux and had success recording\replaying, but I did not exhaustively test older recordings of the same\similar variants. The older recordings I tested were Windows 2000 and XP recordings that I had made and found that the recordings themselves had IO events out of order due to the race condition. The guest instruction count would change during a DMA write if I remember correctly. I thought that this issue was limited to Windows 2000\XP since that's where we saw the race condition, but maybe it was happening with other operating systems too.

Since some of the QEMU commits that were brought in made changes to the CPU loop, I had to adapt cpu-exec.c for record\replay. Its possible that something might be slightly out of order in the new loop, but it should be close.

If I remember correctly @Sisyph's issue could be taken care of by removing the usb-tablet device? I'm not sure if that's a red herring or if that could be a clue to fix @m000's issue.

On Monday, I'll see if I can find anything on my end of things with @m000's recording.

m000 commented 6 years ago

Thanks for taking the time to comment @nathanjackson. #284 is indeed pretty complex, so some rough edges may have gone unnoticed.

I've been looking on cpu_exec() function this afternoon and the adaptation looks alright. The problem seems to be with the cpu_handle_interrupt(cpu, &last_tb) call inside cpu_exec().

In my older branch rr_num_instr_before_next_interrupt() returns 0 before the call and 8927. This allows the execution to progress later because it will be until_interrupt > 0. In master however, rr_num_instr_before_next_interrupt() returns 0 before and after the call, thus the infinite loop.

My recording is pre-#284 and is only a test recording. So I can probably dodge this by re-recording. The bug still hurts the project as it impacts the share-ability of recordings, so I believe it is important to squash it.

I haven't touched the input devices, so unless usb-tablet is a default, I am not using it. The command line for the recording was something like this:

./build-panda/i386-softmmu/qemu-system-i386 -monitor stdio -m 512 -hda ./panda-qcow/ubuntu16-planb.qcow2 -netdev user,id=unet0,hostfwd=tcp::10022-:22 -device e1000,netdev=unet0

I'll be AFK as well for the next few days, so I'll resume on Monday.

m000 commented 6 years ago

Minor update. I checked cpu_handle_interrupt() and not much have changed. In my trace if (unlikely(interrupt_request)) {} is skipped because interrupt_request is 0. So rr_interrupt_request() is next to be checked. But now I have to catch some sleep.

nathanjackson commented 6 years ago

@m000 What were you actually doing when you made the recording? It'd be helpful to know so I can create a similar recording to compare before\after #284.

I'm looking at the non-deterministic log and I'm seeing what I think is the race condition manifesting itself when the recording was created. This was similar to the issues I was seeing when working on 2000\XP support. Here's a snippet:

{guest_instr_count=0}
        RR_INTERRUPT_REQUEST_2 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_POLL
{guest_instr_count=0}
        RR_INPUT_4 239 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_BEFORE
{guest_instr_count=0}
        RR_INTERRUPT_REQUEST_0 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_INTNO
{guest_instr_count=140}
        RR_INPUT_8 2800241490624 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=300}
        RR_INPUT_8 2800243498756 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=365}
        RR_INPUT_8 2800243761532 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=444}
        RR_INPUT_8 2800244687964 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=823}
        RR_INPUT_8 2800249552860 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=1796}
        RR_INPUT_8 2800260635724 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=2278}
        RR_INPUT_8 2800262216496 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=5725}
        RR_INPUT_8 2800267644258 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=6107}
        RR_INPUT_8 2800269143440 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=6384}
        RR_INPUT_8 2800269192924 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=7951}
        RR_INPUT_8 2800270504676 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=8749}
        RR_INPUT_8 2800272751476 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER
{guest_instr_count=8927}
        RR_INTERRUPT_REQUEST_16 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_POLL
{guest_instr_count=8927}
        RR_INTERRUPT_REQUEST_2 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_INTNO
{guest_instr_count=8935}
        RR_INPUT_4 239 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_BEFORE
{guest_instr_count=8935}
        RR_INTERRUPT_REQUEST_0 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_INTNO
{guest_instr_count=9188}
        RR_INPUT_8 2800273477360 from RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER

Notice the multiple RR_CALLSITE_CPU_HANDLE_INTERRUPT_AFTER entries. When an interrupt is replayed, I would expect before interrupt, handle interrupt, and after interrupt in that order.

Also notice that RR_INPUT_8 is in the log during the numerous CPU_HANDLE_INTERRUPT_AFTER entries but rr_interrupt_request(...) I think should have written RR_INTERRUPT_REQUEST instead.

This isn't dismiss the fact that this recording plays back in an older revision, but rather to provide some context to the changes in #284. I suspect that there will have to be some work around, but I'm not sure what it is at this point.

@moyix thoughts on the above?

Sisyph commented 6 years ago

@m000 Sounds like we are having the same problem. The only differences I can see is mine comes midway through the replay.

I checked cpu_handle_interrupt() and not much have changed. In my trace if (unlikely(interrupt_request)) {} is skipped because interrupt_request is 0

This is exactly what I found in trying to debug my replay. And I traced through about that far without finding any obvious problem.

nathanjackson commented 6 years ago

This issue might be a lot simpler than I thought. @moyix I just found out about the panda-users mailing list and read a message you sent back in May.

Is this just because RR_CALLSITE_CPU_HANDLE_INTERRUPT_POLL was added to the start of the list of callsite IDs?

nathanjackson commented 6 years ago

@m000 can you try my branch issue-322? It gets me farther into the recording, but I think you're non-deterministic log still has the issue where the VAPIC interrupt isn't recorded.

This is by no means a permanent solution. This issue-322 branch would break replay of recordings that have been made since early April.

moyix commented 6 years ago

Yeah. Unfortunately we don't have versioning for the file format; this might be a good thing to think about adding.

We should also definitely add a comment near those enums warning that changing their values will break backward compatibility (and, actually – maybe we want to make the values explicit using the NAME = value syntax?).

If we change the ordering back to what it was, will this make older replays (pre-April) work again? Or will the VAPIC interrupt recording changes still cause problems?

m000 commented 6 years ago

@nathanjackson thanks for taking the time to check this.

I've tried updating rr_log_all.h to nathanjackson/panda@6edfbbe874aa19b358e02f185ae0d9473f6e070f. The replay ran a little longer, only to stop after 8927 instructions with:

rr_assertion `current_item' failed at /home/mstamat/spammer/panda/panda/src/rr/rr_log.c:871

I'm not sure if ACTION(RR_CALLSITE_CPU_HANDLE_INTERRUPT_POLL) was deliberately placed after ACTION(RR_CALLSITE_LAST) in nathanjackson/panda@6edfbbe874aa19b358e02f185ae0d9473f6e070f. At any rate, I've also tried moving it before ACTION(RR_CALLSITE_LAST) but the result was the same.

I've also tried with a variant moving ACTION(RR_CALLSITE_IDE_*) block. This has been added after PANDA version used to record my trace. It should also affect the replay of the trace, as the trace is using the e1000 NIC. Again, the result was a crash after 8297 instructions.

nathanjackson commented 6 years ago

@m000 I saw the same results, stopping at 8927. RR_CALLSITE_CPU_HANDLE_INTERRUPT_POLL was placed after RR_CALLSITE_LAST because I wasn't sure if LAST's position in the enum list had any significance in terms of the non-deterministic log. I tried both ways as well and what was currently on that branch was just a result of my last attempt.

I've got a fix on the issue-322 branch and now I can replay your recording. I removed all of the APIC poll event but that made me question whether or not we actually need that fix, even for XP. From my notes, its pretty clear that I was seeing divergence on replay in XP without the VAPIC fix but for the life of me I can't reproduce that issue. I'm wondering if what actually fixed the issue was reverting the removal of the locks in ab58106013f2c462d0eac8a8ac63526caac46ed1 and VAPIC specific issues were a distraction. The VAPIC commit was made before the revert, so this seems like a good possibility and XP still works without it now. What I'm thinking is that the VAPIC fix was a workaround for the divergent behavior but the "real" fix which was to keep the global locks that QEMU fixed.

If I can remove the VAPIC record, then that simplifies things a bit. We could just have a script to fix April-present recordings. Intuitively, it seems more important to maintain compatibility with older recordings than relatively recent ones.

In general though versioned logs and/or some way to extend the non-deterministic log would be nice. I've been working on a feature that adds a few call site types and while it shouldn't affect any old recordings the way its currently written, it would not surprise me if future changes have a similar result.

Let me know what happens with the branch.

m000 commented 6 years ago

@nathanjackson Yes, with the new temporary (?) fix I was able to run through the replay and finish my work on #332. Thanks again for taking the time to check this out. There was still a minor glitch: PANDA didn't exit after the end of the replay message.

What's weird is that I'm fairly sure that I've also tried reverting e6e5c34388dd4117a2e19483415ae29d04aad18f but it didn't work. Maybe I reverted more commits at the same time.

I'm not sure what the best strategy would be for adding versioning to the traces. Maybe a combination of capture date + some internal numbering?

Bit in addition to that, we should also consider a repository for collecting traces to help us do some rudimentary regression testing. This would have helped us e.g. quickly test if the XP divergence still manifests with the issue-322 branch. The repository could be as simple as a google drive folder or a web server folder + an upload form.

m000 commented 6 years ago

@Sisyph Can you check if the changes from issue-322 also fix your trace?

nathanjackson commented 6 years ago

@m000 I don't recall seeing PANDA not exiting. We certainly have to make these changes, but I'm not sure if @moyix wants to keep the VAPIC type and then we could just ignore it in recordings or if we should get rid of it and provide a way to fix recordings made with a version of PANDA since April. I tend to want to just bite the bullet and get rid of it the new type so as not to cause confusion in the future. The impact should be minimized in this case.

When you were trying to make it work by reverting e6e5c34, did the error change? Another possibility, did you happen to remove my merge commit? There are some fixes that have to be in there to make other operating systems work with the slightly newer QEMU baseline.

In regards to versioning, this may be better addressed in another issue but here's my two cents once we get past this infinite loop problem. Admittedly, I didn't realize that PANDA 2's format wasn't suppose to change, but I also didn't realize that I was breaking anything by adding the event type. My understanding was that PANDA 2 was in development. That being said, I think there should be some sort of release cycle. That way there can be a "stable" PANDA and an in development version. I would also suggest maybe changing the log format in the long run to be based on protocol buffers or something similar. The difference in size should be minimal, but it could alleviate our versioning issues pretty easily. Another alternative would be to change the log format to be written out in BSON or message pack. The advantage here is that the log format is schema-less, so the non-deterministic log could be easily extended. We have multiple use cases where it would be great for plugins to be able to run at the time of recording to record some extra data and a schemaless format would be great. This of course would have the added expense of size, but I suspect that adding compression would alleviate some of those issues.

+1 for a repo to store recordings for regression testing. We all have our own use cases with various operating systems\architectures and its hard to know if you've broken something for the community at large.

moyix commented 6 years ago

I agree that versioning and a more structured format (that can tolerate extra messages inserted into the log without going off the rails) would be fantastic. Among other things, it would be nice to be able to add extra metadata as we think of it (for example, recording the command line args used at record-time). I'm not too concerned about size; I doubt there will be much overhead.

I'm slightly worried about serialization overhead at record time, since the overhead of recording is already a bit high, and the current scheme is pretty much just fwriteing raw structs. But perhaps with a bit of buffering this won't be too bad.

Periodically tagging "stable": what would the expectation be here? Would (non replay-breaking) bugfixes get backported to stable? It would be really nice to have a regular interval when we pull from QEMU master.

nathanjackson commented 6 years ago

I would say that we'd want to version PANDA itself. When PANDA reaches "stable" it would be called 2.0.0 since this is PANDA 2 after all. The version scheme would be based on semantic versioning. A major bump (from 2.x.x to 3.0.0) indicates a breaking change between replays. This should be extremely rare. Minor and bugfix bumps should be compatible with any previous recording within the 2.x.x series. For example, 2.0.6 and 2.1.0 should be able to replay 2.0.0 recordings without any problem. However, for minor bumps (like 2.0.0 -> 2.1.0), your recording might replay just fine, but you may not be able to use the new features that depend on new event types or added metadata. On the other hand, if you take a recording with 2.1.0 or 2.0.6, you should would not expect to be able to playback in 2.0.0 (it would be unsupported but not necessarily breaking).

We'd also want to consider the initial snapshot of memory too. If for some reason the snapshot becomes incompatible due to pulling from QEMU master, I would say that's a major version change.

Sorry if that's a mouthful.

nathanjackson commented 6 years ago

Oh @moyix, how do you want me to handle the fix for the infinite loop issue related to VAPIC? We could ignore the VAPIC event in recordings or we could get rid of it and try to provide a script to fix April-present recordings. I tend to want to just bite the bullet and get rid of it the new type so as not to cause confusion in the future. The impact should be minimized in this case.

moyix commented 6 years ago

Yes, I'm inclined to take out the new type. It shouldn't be too bad to modify rr_print to fix up the numbering for logs from April to now, and it's a one-time thing.

Sisyph commented 6 years ago

@m000 Unfortunately, these changes do not resolve my issue. I still get an infinite loop partway through the replay only when using debug llvm.

m000 commented 6 years ago

I like the implicit/semantic versioning proposed by @nathanjackson. If any incompatibilities can be contained in the PANDA core, that would work nicely. However, if incompatibilities also "leak" to plugins code, this may cause more headaches than it solves.

Maybe providing "migration" scripts each time the compatibility breaks is less effort. We could even build a small corpus of traces that PANDA should be able to process, now and in the future. This will enable early detection for many inadvertent compatibility breaks.

@moyix Is the current trace format documented outside the source code?

Given the opportunity, does anyone have experience with bson or msgpack? Would either be suitable for either PANDA trace or PANDA logging (instead of protobufs)?

(PS. sorry for the radio silence in the past days. I experienced some unscheduled downtime.)

nathanjackson commented 6 years ago

I have a PR to remove the poll interrupt callsite. Unfortunately, this did not solve @Sisyph's issue. But I do think its interesting that he only has this issue when debugging the LLVM instructions?

I'm creating an issue continue discussion about the log format.

github-actions[bot] commented 3 years ago

Stale issue message

panda-re / panda

Infinited loop detected when debugging tainting with llvm debug build #322