[USB3380] Incorrect dumps

dev-zzo commented 6 years ago

I am trying to use the USB3380 board to dump memory from a few test laptops. While dumping itself is working, I could not manage to produce two equal dumps ever. Each one is different, with differences grouped in blocks. Example: https://imgur.com/qF1Od0v

Same thing with testmemread -- it never completes properly. This points at the issue being in or below the DeviceReadDMA() function. Printing addresses coming to the Device3380_ReadDMA() function doesn't reveal anything suspicious -- aligned, sequential.

Host: Win10 x64; Target: various. Both exhibit similar symptoms.

I have also tried a Win7 x64 host, with the same result. In all cases I used the more reliable -usb2 option.

How can I debug this further?

ufrisk commented 6 years ago

This issue is the main reason why I implemented the testmemread and testmemreadwrite functionality.

Unfortunately, it's not a bug in PCILeech. It's a bug in how the USB3380 interacts with some target computers. I've been able to locate this behavior on two distinct older hardwares I tested on.

In your dump it seems like the two 128-byte blocks are swapped incorrectly. I suspect this is due to the USB3380 hardware handling received PCIe TLPs incorrectly (i.e. assuming memory read completions are received in-order when there is no absolute guarantee for that). (just speculating on this one a bit, I don't know for sure).

Unfortunately I never managed to resolve the problem, and since it only affected one very old test computer I had badly and another test computer just a little bit I moved on to other target systems that never had this problem.

I'll see if I can get a hold of that old system I had trouble with in the past to see if this is indeed the case. I'll probably be able to observe it using the FPGA hardware that I also support. Also, FPGA hardware wouldn't have an issue with this, if the error is really what I believe it to be.

dev-zzo commented 6 years ago

I see. I observe it is in fact granular on 0x40 byte boundary...

I suspect there might be an issue with how the FIFOs are controlled. E.g. the read and write pointers are out of sync, for whatever reason. This would explain the weird behaviour I am seeing with my two systems, but I'm not sure how to test or possibly fix that. IIRC there was a way to reset all the FIFOs, I will give it a go tomorrow.

If the root cause is the ordering of completions, then by setting iosize to e.g. 0x40 we'd observe the problem disappear. I tried that, and the problem still persisted so it may be something else instead.

For the record, the two systems I tested on are Lenovo S-10-2 and Acer Aspire 4730Z. Not so new indeed. :-)

I will also test against newer systems during the coming days. I have Lenovo B50 handy, and will have access to more on Monday.

ufrisk commented 6 years ago

The -iosize parameter unfortunately only have a granularity of 0x1000 bytes.

I'll see if I can find that old test computer I had so much trouble with next week so that I can look into this again. Last time I never came up with an acceptable solution though. If one needs to dump memory at 0x40 byte chunks transfer speed will drop dramatically (probably below 1MB/s).

dev-zzo commented 6 years ago

Didn't know there was granularity limit on that parameter. Then, my results are inconclusive. :-D

Could you suggest any tests for me to run in the meanwhile?

dev-zzo commented 6 years ago

I've ran a few tests against other laptops we have in the house, with the newest one being Dell Precision 7510, a rather modern system. Unfortunately, the issue can be reproduced there as well if I boot into the BIOS setup screen for example.

At times it also starts failing every page read after reading X pages successfully, which I suspect to be an unrelated issue.

ufrisk commented 6 years ago

Thanks for the extra info. I'll try to see if I can find a system that have this problem, but the known bad one I had earlier on is no more.

If you know how to code c you could alter the read size in https://github.com/ufrisk/pcileech/blob/master/pcileech/device3380.c#L144 (loop over 0x40 byte chunks). Or I could look into this during the weekend as well. It will become super slow though; I'm guessing around 1MB/s.

The other error is also well known. The USB3380 freezes if it tries to read non-existing memory. If your read encounters non-existing memory in a memory hole or memory mapped PCIe devices (around 3-4GB usually) it will freeze and stop working until power cycled. I have not been able to clear the freeze programmatically in any way unfortunately. I'm assuming your read errors start occurring in high 32-bit memory around 3-4 GB?

dev-zzo commented 6 years ago

I am able to code and build the project from the sources, so please feel free to suggest any potential test cases for me to try out.

On the matter of speed, it is somewhat secondary at the moment as getting proper memory access is more desirable. It is of little use if the device works fast but broken. :-D

Regarding the hole, actually it doesn't occur that high in memory -- somewhat random I'd say. For example, I was able to read it out after restarting the target system without this issue being triggered... so I'd assume there are no holes or mapped devices in that memory range, unless they are mapped completely randomly in the middle of RAM memory space, which sounds somewhat unlikely to me.

ufrisk commented 6 years ago

With regards to test cases, the testmemreadwrite and testmemread will be a good start. Unless you get there before me I could code the 0x40 read loop sometime in the weekend.

About the freeze I come to think of three things. Except for this I haven't seen your reported behavior beforehand.

1) I had a similar problem when connecting the USB3380 to M.2 slot via adapter and rather long flat cable. The problem went away when using a really short flat cable in that case. I guess it just had to do with the long extension cable degrading the connection too much.

2) Some PCs have the memory holes, which will freeze the USB3380, at lower locations in memory as well. An Intel NUC skull canyon I had had this problem. I don't think this is your problem since those locations are static ...

3) Are you running HyperV/Credential Guard/Device Guard or similar on your target system? Windows will then protect the hypervisor and secure kernel from DMA reads, which will mess with the USB3380 making it freezing at random locations.

dev-zzo commented 6 years ago

I'll get to patching in arbitrary iosize values then.

I believe the M.2 adapter is to be blamed here, then, as this was the one I've used, albeit with the shortest cable provided with it. It seems those are not of enough quality to actually do the job right.

dev-zzo commented 6 years ago

Here are testing results: https://gist.github.com/dev-zzo/b300ab7807bce8ccb51bae77b8ee020b

I managed to get testmemread working -- it completes 5000 iterations successfully, but not without hiccups.

I see multiple timeouts when executing reads -- I've implemented automatic retry to work around that, but it'd be nice to figure out what causes those. They are marked with !!! in the gist above.

Following this success, I've tried to execute memory dumps, with quite interesting results: https://gist.github.com/dev-zzo/e0a74cb55f7d466ffe92eedd0e595990

You can see that each time the read operation times out, the contents is duplicated in the resulting dump. Rerunning the test returns (without timeouts) content after the requested end address of 0x25000 (for expected 0x1C0 bytes) and then continues with expected data! I am not sure how this can be explained at all, since data past 0x25000 was never requested by the application.

dev-zzo commented 6 years ago

OK, I found the root cause, or at least one of them. I've implemented aborting an outstanding DMA transfer before starting a new one, and this fixed timeout issues together with brokenness of the dump. Now I can reliably dump the first 16M of memory with iosize set to 0x1000. Without the explicit setting things still break. This may be due to partial reads from the USB endpoint as IIRC the FIFO buffer is around 2k or 4k. Will look further into it.

Yes, with block sizes greater than 0x1000 I get errors ERROR_SEM_TIMEOUT or ERROR_GEN_FAILURE. When the code backs to 0x1000 automatically from the default of 0x10000, dumping resumes until it dumps all the 16 pages in that region.

ufrisk commented 6 years ago

if I understand things correctly it starts working flawlessly (no read error anymore) if you always abort any previous ongoing DMA transfers first as per: https://github.com/dev-zzo/pcileech/blob/master/pcileech/device3380.c#L158 ? No need to lower transfer size to 0x40 byte sized chunks?

Also, it's still behaving badly if transfer size is higher than 0x1000?

If so this would be a very nice thing to add :)

dev-zzo commented 6 years ago

You are correct in your understanding. There is no need to lower the chunk size below 0x1000, and it works flawlessly when you call abort before each DMA transfer. The code you referenced is what I ended up with yesterday. I will also try and see if moving the abort to e.g. the open function will speed things up somewhat -- it should, in theory.

Testing with large chunk sizes is still ahead...

I will make proper PRs when the fixes are ready to be integrated, of course.

ufrisk commented 6 years ago

Awesome, looking forward to that PR :) I'll have a bit more time to look into things this weekend as well.

dev-zzo commented 6 years ago

Today I managed to make the dump work against the Lenovo S10-2 laptop I had; this required dropping iosize to 0x800 -- after that the dumps were reproducible. But, small iosize values break page read statistics.

dev-zzo commented 6 years ago

OK, I'm back from a business trip... the PCIe extender set has made it to my location finally. Testing with the longest cable (about 30cm I think) was all right, surprisingly, so the length might not be a problem.

ufrisk commented 6 years ago

Welcome back :) unfortunately I haven't found my old system that had a similar problem.

If we can make changes to only the device3380 that don't lower performance too much for non-affected systems we're good to go :)

Lowering the possible iosize to sub-0x1000 might work as well, but I suspect there might be quite a lot of breakage elsewhere, not just in the statistics. I really have to look that one up - also for the FPGA devices. Going with the -device-opt0 parameter might be a way to handle sub-0x1000 read/writes as well if it would be handled transparently within the device3380 code.

ufrisk / pcileech

[USB3380] Incorrect dumps #36