sensepost / USaBUSe

Universal Serial aBUSe is a project to demonstrate the risks of hardware bypasses of software security by Rogan Dawes at SensePost.

Other

493 stars 86 forks source link

Discussion - full duplex by splitting HID in / out into separate composite functions (no issue) #15

Open mame82 opened 7 years ago

mame82 commented 7 years ago

Continue of ongoing discussion from here

mame82 commented 7 years ago

@RoganDawes Thought opening a new issue is a good idea before going off Topic too much. I haven't much time for coding right now, but wanted to test the idea discussed.

So here are the new descriptors for the two Composite HID functions:

Input to host:

0x06, 0x00, 0xFF,  // Usage Page (Vendor Defined 0xFF00)
0x09, 0x01,        // Usage (0x01)
0xA1, 0x01,        // Collection (Application)
0x09, 0x01,        //   Usage (0x01)
0x15, 0x00,        //   Logical Minimum (0)
0x26, 0xFF, 0x00,  //   Logical Maximum (255)
0x75, 0x08,        //   Report Size (8)
0x95, 0x40,        //   Report Count (64)
0x81, 0x02,        //   Input (Data,Var,Abs,No Wrap,Linear,Preferred State,No Null Position)
0xC0,              // End Collection

Output from host:

0x06, 0x00, 0xFF,  // Usage Page (Vendor Defined 0xFF00)
0x09, 0x01,        // Usage (0x01)
0xA1, 0x01,        // Collection (Application)
0x09, 0x02,        //   Usage (0x02)
0x15, 0x00,        //   Logical Minimum (0)
0x26, 0xFF, 0x00,  //   Logical Maximum (255)
0x75, 0x08,        //   Report Size (8)
0x95, 0x40,        //   Report Count (64)
0x91, 0x02,        //   Output (Data,Var,Abs,No Wrap,Linear,Preferred State,No Null Position,Non-volatile)
0xC0,              // End Collection

Not sure if using a collection is still necessary and if usage on the second descripor really Needs to be changed to two, but they should work.

And here Comes the first problem, thanks to powershell. Need to find a way to distinguish between in and out Interface, not sure if this is possible with WMI. Test code:

function GetDevicePath($USB_VID, $USB_PID)
{
    $HIDGuid="{4d1e55b2-f16f-11cf-88cb-001111000030}"
    foreach ($wmidev in gwmi Win32_USBControllerDevice |%{[wmi]($_.Dependent)} ) {
        #[System.Console]::WriteLine($wmidev.PNPClass)
        if ($wmidev.DeviceID -match ("$USB_VID" + '&PID_' + "$USB_PID") -and $wmidev.DeviceID -match ('HID') -and -not $wmidev.Service) {
            $devpath = "\\?\" + $wmidev.PNPDeviceID.Replace('\','#') + "#" + $HIDGuid
            "Matching device found $wmidev"
        }
    }
    #$devpath
}

$USB_VID="1D6B"
$USB_PID="fdde" # full duplex device ;-)

GetDevicePath $USB_VID $USB_PID

Result:

Matching device found \\WUNDERLAND-PC\root\cimv2:Win32_PnPEntity.DeviceID="HID\\VID_1D6B&PID_FDDE&MI_03\\8&B609427&0&0000"
Matching device found \\WUNDERLAND-PC\root\cimv2:Win32_PnPEntity.DeviceID="HID\\VID_1D6B&PID_FDDE&MI_04\\8&2F37D1E9&0&0000"

Please ignore my nice hostname ;-)

So if this would be Linux, I guess I would be able to check which device file is readable and which is writable. As this unfortunately is Windows, I have to provide this Information to createfile. I'm going to check the Win32_USBControllerDevice attributes for usefull Information on this tomorrow. Worst case: using HIDD* win32 methods for enumeration would be needed.

Let me know if you have any ideas on this

RoganDawes commented 7 years ago

I honestly don't think it is necessary to have two raw hid interfaces, although technically it may be possible to double your throughput as a result. I think the real problems are shitty powershell and lack of "streaming". If you get 1000 packets per second, each packet has to go within 1 ms of each other. However, I measured latencies of 10-20 ms just changing from writing to reading in Powershell, which kills your throughput right there.

Making the protocol less chatty, i.e. having the sender continue to send until the receiver indicates to slow down seems like the way to go!

mame82 commented 7 years ago

Hi @RoganDawes

As promised I've done some tests on synchronous transfers using two separate Interfaces.

technically it may be possible to double your throughput as a result

You're absolutly right on this.

I've done 4 tests from Powershell:

1) Writing out 1000 64byte reports on dedicated HID out Interface, result: About 8 seconds (= 8Kbyte/s)

2) Writing out 1000 64byte reports on dedicated HID out Interface, echo them back and and reading them from a dedicated HID in Interface via seperate thread. Result: Again about 8 seconds (= 8Kbyte/s) reading back input data while writing data out has no Speed Impact

Test 3 and 4 have been the same, but I was hoping that the FileStream could write up to 8 concurrent reports, as they are created with FILE_FLAG_OVERLAPPED. The results are still disappointing, transfering 64KByte takes about 100ms less time, Here's the testoutput of test 3 and 4:

Path: \\?\hid#vid_1d6b&pid_fdde&mi_02#8&2324206c&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_fdde&mi_03#8&b609427&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 0
Path: \\?\hid#vid_1d6b&pid_fdde&mi_04#8&2f37d1e9&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 0, Output: 65
Writing 1000 reports with synchronous 'Write'
 Hello World                                                     
 Hello World                                                     
 .. snip ... (1000 Hello World from output thread, echoed back by bash via cat /dev/hidg2 > /dev/hidg1)
 Hello World                                                     
HID out thread finfished, time taken 8,1890945 seconds
Writing 1000 reports with async 'BeginWrite', 8 concurrent writes
 Hello World                                                     
 .. snip ... (1000 Hello World from output thread, echoed back by bash via cat /dev/hidg2 > /dev/hidg1)
 Hello World                                                     
HID concurrent output thread finfished, time taken 7,9576403 seconds
Killing remaining threads
 Hello World                                                     
Godbye

To sum up: It seems the FileStream methods of .NET aren't able to reach the maximum transfer rate (1000 64byte reports per second on USB 2.0), no matter how hard I try. So i give up on synchronous Transfer, as the benefit is low while the effort is high (according that both of us have working implementations with multiple protocol layers).

i.e. having the sender continue to send until the receiver indicates to slow down

Considering my tests, I doubt that there would be a speed increase with this (at least not for P4wnP1, as HID communication always runs at maximum Speed, while the upper thread based layers work on demand). Here's the Output code, which has no console IO or Array creation Overhead, but reaches 8KB/s max:


    $outbytes = New-Object Byte[] (65)

    $msg=[system.Text.Encoding]::ASCII.GetBytes("Hello World")
    for ($i=0; $i -lt $msg.Length; $i++) { $outbytes[$i + 1] = $msg[$i] }

    for ($i=0; $i -lt 1000; $i++)
    {
        $HIDout.Write($outbytes,0,65)
    }

And here's my testscript, use it as you need to. I solved the problem of enumerating device Interfaces based on HID report descriptors, which took a ton of shitty csharp code. This is another reason to leave this path. The only useful thing about this code is that I'm able to spot my Composite device based on Serial + manufacturer string, wwhich isn't possible with WMI numeration (Strings for interface drivers are different), which is nice because as said I often Change VID/PID, but again crating a temporary csharp file for inline compilation renders this useless.

So I guess im stuck at ~4Kbyte Maximum synchronous Transfer or could achieve ~8Kbytes at the costs of shitty NET code (while consuming additional USB EPs). Maybe I'll use dedicated input / output reports for faster file transfer later on and I'll still be slower than my first Modem ;-).

Best regards and thanks for the Exchange on this.

P.S. Excuse typos, my spellcheck is fighting against English language

RoganDawes commented 7 years ago

Hi,

It is certainly NOT the case that Windows/Powershell cannot achieve higher than what you are currently able to get. I have been able to get up to 48kBytes/second using powershell code (however, it was unreliable/inconsistent, and didn't include all the "refinements" required for multiple concurrent connections).

It did maintain that rate for at least several seconds, so I don't think that this is an unreachable goal.

Rogan

mame82 commented 7 years ago

Could you provide a snippet of code reaching this rate, the snippet I provided above has no (obvious) room for improvement (writes data out in minimal loop)

RoganDawes commented 7 years ago

Here is some real sample code, and some numbers to go with it:

On Windows:

speedtest.ps1

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
    public class w {
        [DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
        public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

        public static FileStream o(string fn) {
            return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
        }
    }
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

& {
    $devs = gwmi Win32_USBControllerDevice
    foreach ($dev in $devs) {
        $wmidev = [wmi]$dev.Dependent
        if ($wmidev.GetPropertyValue('DeviceID') -match ('1209&PID_6667') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
            $fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
        }
    }
    try {
        $f = [n.w]::o($fn)
        $d = New-Object IO.MemoryStream
        $c = 0
        $b = New-Object Byte[]($M+1)
        $sw = $null
        while($c -lt 1024 * 1024) {
            $r = $f.Read($b, 0, $M+1)
            if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
            $d.Write($b,1, $M)
            $c += $M
        }
        $sw.Stop()
        $sw.Elapsed
        $d.Length
        ([Text.Encoding]::ASCII).GetString($d.ToArray())
    } catch {
        echo $_.Exception|format-list -force
    }
    exit
}

This waits for the first successful read from the RAW HID interface and starts the stopwatch, then exits after 16384 iterations (1024*1024/64==16384)

I run it like so:

> powershell -exec bypass -file speedtest.ps1 > log.txt

On the Pi:

# time seq -f "%063g" 1 17000 > /dev/hidg1

This writes the numbers 1-17000, formatted into a 63 character zero-padded string (with a CR added to make 64 bytes) to the hid interface. I run longer than 16384 to account for any packets getting lost.

The results (with ^M's removed):

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 16
Milliseconds      : 402
Ticks             : 164026490
TotalDays         : 0.000189845474537037
TotalHours        : 0.00455629138888889
TotalMinutes      : 0.273377483333333
TotalSeconds      : 16.402649
TotalMilliseconds : 16402.649

1048576
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
...
000000000000000000000000000000000000000000000000000000000016383
000000000000000000000000000000000000000000000000000000000016384

In other words, 1MB successfully transferred (with no lost packets, since the last number is indeed 16384) in 16.4 seconds, a total rate of 63 937 bytes/second.

RoganDawes commented 7 years ago

What gets interesting is adding a Write statement into the powershell after $d.Write(), that writes the received bytes back to the device, then updating the command on the pi to:

time seq -f "%063g" 1 16384 | socat - /dev/hidg1 > t

In theory, seq would generate 16384 lines, the powershell would echo them one at a time, and "t" would contain those 16384 lines.

In practice, I ended up with only around 2000 lines in "t", and the powershell waiting to receive its full 1MB. I even upped the number of lines to 65536, and the powershell still didn't terminate, indicating that even though I sent 4 times more data than required, it still didn't successfully receive a full 1MB :-(

So, this seems to be a fairly fundamental limitation of USB raw hid (perhaps only on Windows, and perhaps only with Powershell. More testing required!)

In fact, this might justify using two endpoints, one for reading, and one for writing, which I otherwise thought was a bad idea. ;-)

RoganDawes commented 7 years ago

And this code, while not beautiful by any means, gets reasonable performance, while not losing any data!

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
    public class w {
        [DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
        public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

        public static FileStream o(string fn) {
            return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
        }
    }
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

$readloop = {
    Param($M, $f, $q)

    try {
        $d = New-Object IO.MemoryStream
        $c = 0
        $sw = $null
        while($c -lt 1024*1024) {
            $b = New-Object Byte[]($M+1)
            $r = $f.Read($b, 0, $M+1)
            if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
            $d.Write($b,1, $M)
            $c += $M
#           [System.Threading.Monitor]::Enter($q)
            $q.Enqueue($b)
#           [System.Threading.Monitor]::Pulse($q)
#           [System.Threading.Monitor]::Exit($q)
        }
        $sw.Stop()
        $sw.Elapsed
        $d.Length
        ([Text.Encoding]::ASCII).GetString($d.ToArray())
    } catch {
        $_.Exception|format-list -force
    }
    exit
}

$writeloop = {
    Param($M, $f, $q)

    try {
        while ($true) {
#           [System.Threading.Monitor]::Enter($q)
#           [System.Threading.Monitor]::Wait($q)
            [System.Console]::Write("!")
            if ($q.Count -gt 0) {
                [System.Console]::WriteLine($q.Count)
                while ($q.Count -gt 0) {
                    $b = $q.Dequeue()
                    $f.Write($b, 0, $M+1)
                }
            }
            Start-Sleep -m 10
#           [System.Threading.Monitor]::Exit($q)
        }
    } catch {
        [System.Console]::WriteLine("Write Thread Done!")
        $_.Exception
    }
    exit
}

$Q = New-Object System.Collections.Queue
$Q = [System.Collections.Queue]::Synchronized($Q)

$devs = gwmi Win32_USBControllerDevice
foreach ($dev in $devs) {
    $wmidev = [wmi]$dev.Dependent
    if ($wmidev.GetPropertyValue('DeviceID') -match ('1209&PID_6667') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
        $fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
    }
}
$f = [n.w]::o($fn)

$readThread = [PowerShell]::Create()
[void] $readThread.AddScript($readloop)
[void] $readThread.AddParameter("M", $M)
[void] $readThread.AddParameter("f", $f)
[void] $readThread.AddParameter("q", $q)

$writeThread = [PowerShell]::Create()
[void] $writeThread.AddScript($writeloop)
[void] $writeThread.AddParameter("M", $M)
[void] $writeThread.AddParameter("f", $f)
[void] $writeThread.AddParameter("q", $q)

[System.IAsyncResult]$AsyncReadJobResult = $null
[System.IAsyncResult]$AsyncWriteJobResult = $null

try {
    $AsyncWriteJobResult = $writeThread.BeginInvoke()

    Sleep 1 # Wait 1 second to give some time for the write thread to be ready
    $AsyncReadJobResult = $readThread.BeginInvoke()
    Write-Host "Ready"
} catch {
    $ErrorMessage = $_.Exception.Message
    Write-Host $ErrorMessage
} finally {
    if ($readThread -ne $null -and $AsyncReadJobResult -ne $null) {
        $readThread.EndInvoke($AsyncReadJobResult)
        $readThread.Dispose()
    }

    if ($writeThread -ne $null -and $AsyncWriteJobResult -ne $null) {
        $writeThread.EndInvoke($AsyncWriteJobResult)
        $writeThread.Dispose()
    }
    exit
}

On Linux:

pi@raspberrypi:~ $ time seq -f "%063g" 1 16384 | socat - /dev/hidg1 > t

real    0m40.080s
user    0m1.030s
sys 0m3.360s
pi@raspberrypi:~ $ wc -l t
16384 t
pi@raspberrypi:~ $

So, 40 seconds elapsed, to send 2MB back and forth (1MB in each direction), with no errors or lost packets. That's pretty good, I think!

I was hoping to use the Monitor class to allow one thread to notify the other, but I ended up with a deadlock, where the reader had already added an item and pulsed $q, while the writer had just ended the while loop and was getting around to calling Wait($q). Since the reader had read the last packet, there were no more "pulse"s sent, and the writer waited forever, even though there was actually one last packet in the queue.

mame82 commented 7 years ago

I'll test this with the two endpoint script provided above. It's likely that I missed the transfer interruption which occurred in your test, because I terminated the loops after 1000 reports. Additionally I have to test the code provided by you and try to work out what causes the transfer rate difference. I'm not able to work on this during weekend, but will report back on both next week.

mame82 commented 7 years ago

Although I hadn't time to fully dive into you code it seems you could change it. The $q object is already thread safe I think (using it on my stage 2 without manual synchronizing without issues). As you applied a 10 ms sleep at the write loop, you could move this sleep into an else branch of the condition if ($q.Count -gt 0). Thus you would react on enqueued data with a 10ms delay (only if $q.count has fallen to 0, not if there's continuous data in the queue). After doing this it should work without Monitor lock (trigger based on synchronized $q.Count).

I've done this here (line 233 and 310) but with a 50 ms delay. CPU consumption goes below 1 percent with the sleep. Sleep has no impact on throughput of continuous data, but again this is on an upper layer.

So I'm going to report back next week...promise

mame82 commented 7 years ago

Ignore the last comment. I missed that the inner while loop empties the queue before the sleep is called. You've been looking for a way to avoid polling queue count via monitor, as far as I understand.

mame82 commented 7 years ago

Although I should do other things right now, I couldn't stop thinking about your examples. So I tested the following code (only changed PID/VID and added early out to WMI device enumeration):

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
    public class w {
        [DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
        public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

        public static FileStream o(string fn) {
            return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
        }
    }
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

& {
    $devs = gwmi Win32_USBControllerDevice
    foreach ($dev in $devs) {
        $wmidev = [wmi]$dev.Dependent
        if ($wmidev.GetPropertyValue('DeviceID') -match ('1D6B&PID_0137') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
            $fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
            break # second dev string invalid handle
        }
    }
    try {
        $f = [n.w]::o($fn)
        $d = New-Object IO.MemoryStream
        $c = 0
        $b = New-Object Byte[]($M+1)
        $sw = $null
        while($c -lt 1024 * 1024) {
            #[Console]::WriteLine("$c")
            $r = $f.Read($b, 0, $M+1)
            if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
            $d.Write($b,1, $M)
            $c += $M
        }
        $sw.Stop()
        $sw.Elapsed
        $d.Length
        ([Text.Encoding]::ASCII).GetString($d.ToArray())
    } catch {
        echo $_.Exception|format-list -force
    }
    exit
}

On my first attempt I thought it wasn't working (thats why I added in the printout of $c). Leaving the Code running for more than a minute, I recognized that my issue isn't PowerShell:

PS D:\P4wnP1> D:\P4wnP1\powershell\tests\fastread_usabuse.ps1

Days              : 0
Hours             : 0
Minutes           : 2
Seconds           : 11
Milliseconds      : 64
Ticks             : 1310641048
TotalDays         : 0,00151694565740741
TotalHours        : 0,0364066957777778
TotalMinutes      : 2,18440174666667
TotalSeconds      : 131,0641048
TotalMilliseconds : 131064,1048

1048576
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
... snip ...
000000000000000000000000000000000000000000000000000000000016381
000000000000000000000000000000000000000000000000000000000016382
000000000000000000000000000000000000000000000000000000000016383

I'm still stuck at ~7,8 KBytes/s

I'm using usb_f_hid.ko with libcomposite.ko which is needed for the P4wnP1 Features, could it be that you're using g_hid.ko ?

I don't get it - why is my USB communication too slow :-(

mame82 commented 7 years ago

Okay, you're using the same .

So I'm not sure where to go now, have to think about what causes the speed drop.

mame82 commented 7 years ago

I'm exactly at 1/8th of your Speed (8ms per report read/write). This explains my 3,2KBytes/s upper bound. I use alternating read/write which consumes 16ms per 64byte Fragment, leaving me at an effective maximum of 4KBytes/s for one direction. This drops to 3,2 KByte/s because of my naive implementation for fragment reassembling.

Could we compare Raspberry specs:

root@p4wnp1:~# ls /sys/class/udc
20980000.usb
root@p4wnp1:~# uname -r
4.4.50+
root@p4wnp1:~# lsmod | grep hid
usb_f_hid              10837  6
libcomposite           49479  15 usb_f_ecm,usb_f_hid,usb_f_rndis

I hope the UDC is always the same ... still don't get it

mame82 commented 7 years ago

I've found the root cause (wouldn't be able to sleep otherwise). New results:

PS C:\Users\XMG-U705> D:\P4wnP1\powershell\tests\fastread_usabuse.ps1

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 16
Milliseconds      : 400
Ticks             : 164008304
TotalDays         : 0,000189824425925926
TotalHours        : 0,00455578622222222
TotalMinutes      : 0,273347173333333
TotalSeconds      : 16,4008304
TotalMilliseconds : 16400,8304

1048576
000000000000000000000000000000000000000000000000000000000000001
... snip ...
000000000000000000000000000000000000000000000000000000000016383
000000000000000000000000000000000000000000000000000000000016384

It was a really silly mistake. I used this code for my HID device

# create RAW HID function
# =======================================================
if $USE_RAWHID; then
mkdir -p functions/hid.g2
echo 1 > functions/hid.g2/protocol
echo 1 > functions/hid.g2/subclass
echo 8 > functions/hid.g2/report_length
cat $wdir/conf/raw_report_desc > functions/hid.g2/report_desc
fi

The report legth I used was exactly 1/8th of the size it should be (copy paste from HID Keyboard code). So I'm at the same speed, as you now. I have a minor issue in my Code, because I reassemble the reports by manually creating new static arrays. This has a huge performace impact on large transfers (a static array with size=old_size+64 is created before concatenating new data) and hinders me in doing meaningfull Transfer tests. I'll do some new speed tests on file transfer after changing the report assembler to work with MemoryStream instead.

I've already implemented a application layer function, which loads remote files to a dynamially generated powershell variable in the host process, which could be used for file transfer testing. I'm going to revisit the full dulpex approach with two EPs only if this function is working to slow (less than half of maximum transfer rate). Anything else should be doable with more code optimization. Continuous alternating read/write seems to be okay for me.

@RoganDawes I want to thank you very much for discussing these points, if you'd like me to do additional tests ask at any time (but please avoid forcing me to use an AVR ;-)).

Additionally I want to mention a new problem, which could affect both of our projects. See here

RoganDawes commented 7 years ago

Nice work! As you can see, I am implementing the same idea both on AVR/ESP and Linux, so we can continue to collaborate ;-)

For the moment, I'm happy with the default Raspbian kernel, not having run into the problems that you have yet. I'll keep it in mind should I encounter them, though! Thanks!

RoganDawes commented 7 years ago

The continuous alternating read/write effectively halves your throughput, which you can recover by doubling the number of end points. An alternative is to simply reduce the number of acknowledgements required. My thinking is to eliminate the ACK's from my code entirely, and assume that the packet was received unless a NAK packet is received. This would be triggered by an out of sequence sequence number being received (1,2,4, for example). If this happens, the sender could retransmit the missing packet, and continue from that point.

Of course, this means that the sender needs to keep a number of past packets available for retransmission if necessary. And, given that my sequence numbers are only 4 bits, there is a chance that the wrong packet gets retransmitted. Hmmm!

I wonder how well "packing" a number of ACK's into a single packet would work. i.e. Set a "P-ACK" flag, then pack a whole lot of ACK's into the data portion of the packet. At 4 bits, and 60 data bytes, one could pack up to 120 ACKs into a single packet, significantly reducing the number of ACK packets required, and boosting the one-way data rate. Instead of 1:1 (data:ack), you could get 120:1, and the one way traffic would essentially approach the maximum. Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data.

Regardless, implementing the blocking read thread, and updating the rest of the code in Proxy.ps1 should result in significant improvements! Let's see if I can actually achieve 32000 bytes per second?!

mame82 commented 7 years ago

I'm not sure why, but I'm at 48KBps with alternating read/write (should be <32KB).

https://www.youtube.com/watch?v=MI8DFlKLHBk&yt:cc=on

Used code is here (stage1_mini.ps1, stage2.ps1 and hidsrv9.py): https://github.com/mame82/P4wnP1/tree/devel/hidtools

mame82 commented 7 years ago

I have to recap your suggestions and remaining possibilities:

1) P-ACK solution

pro: nearly max rate in one direction
con: keeping track of acks
con: complicated if communication changes direction

2) SYNC/ACK (alternating read/write)

con: half Speed
pro: SYN/ACK wouldn't be needed at all. I use a 4 Byte header on every Report. SND and RCV fields fullfill the Need of SYN/ACK but aren't considered. In fact the only time I effectivly use the SND field, is if the payload in a report is smaller than 60 Bytes (to trim it down, but this could be done in higher layers, too).
pro: I was able to achieve more than half data rate (48 KByte/s, see Video link above) with this Approach, when I changed stream re-assembling to use MemoryStream instead of Byte[]. The Performance improvment seems to be influenced by disabeling of other USB functions (RNDIS/ECM) and more Speed increase was achieved by using AsyncWrite on the MemoryStreams

3) Dedicated OUT / IN endpoints

pro: Maximum rate should be achievde (could be increased even more, by using additional endpoints, like 2 INPUT devices and 2 OUTPUT devices, as 1000 Reports/s seems the per device Limit)
con: SYN/ACK has to be reintroduced, because blocking write calls seem to be imposible, thus data written to the out report could be overwritten by next report if the Receiver hasn't read already. This could be circumvented with an ACK on every packet received (again alternating read/write) or multiple ACKs in a later packet (P-ACK Approach)
con: consumes additional USB EPs

So for me, the alternating read/write approach seems the way to go, for the following reasons:

large Transfers in both directions (data exfiltration from target, file injection from device to target)
USB EPs needed for other USB Gadget functions (RPi UDC seems to be able to provide only 8 EPs)
Transfer rate is okay, while using a simple protocol, without much Overhead
read/write packets are used as heartbeat if empty and deliver STAGE2 payload as continuous Loop (on an otherwised unused src/dst, which is a channel in your terminology)

mame82 commented 7 years ago

Maybe the report ID could be used to improve things further (destination port, or in your case the channel could be outsourced to there). I don't know how much the report descriptor grows if one defines 255 possible report IDs (0 seems to be reserved for no ID in use)

mame82 commented 7 years ago

Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data.

Why ? As far as I understand your idea, payload data would be decoupled from Logical Header data (SYN, ACK, P-ACK, SEQ). The Scenario of changing send direction is like PUSH..., instead of sending the P-ACK packet, the old Sender is waiting for, the new sender sends its data (with an urgent or push flag) and the old Sender (which is now receiver) knows that he has to reassemble an incoming stream, before he continues to send its own data (and receive the p-ACK). A peer which wants to send data, while still receiving an incoming stream (in the middle of the 120 packets) could decide on his own, if he wants to send his data with URGENT flag (PUSH) or if the pending outbound data should be cached in a send queue till the incoming transmission is fully received. If the outbound buffer is growing too large, due to permanent incoming transmissions, data could be forced to be sent with PUSH flag set.

But from my understanding, no matter how far you optimize, it ends up in a more optimized HALF-DUPLEX mode.

The alternating read/write I'm using asures that a FULL DUPLEX channel is available at anytime (but at half rate ... still not sure why I achieved more than 32KB/s), as every packet could carry payloads in the direction needed (on demand).

RoganDawes commented 7 years ago

Yes, indeed, you are absolutely right, that there is no need to flush the P-ACKs.

As you have observed, I am modelling my protocol on TCP, so I'm not sure where you get half-duplex from. Using the P-ACK approach, each side can get up to (120/121)% of the available bandwidth for their own transmissions, if the other side is not transmitting any actual data. If both sides need to transmit, it should balance appropriately depending of the ratios that each side requires, up to a 50:50 split.

What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing.

Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher.

mame82 commented 7 years ago

up to a 50:50 split

That's why I stated one ends up with half-duplex (if both side are sending, although your idea scales very well, if only one side is sending)

What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing.

I'm still trying to get this. I was planning to extend your read-loop example, to a test with concurrent but independant read and write on both ends (separate threads). As I still don't get, why I could reach a Transfer rate >32KBps, I think it could be possible that the device file allows writing and reading at the same time (SHARED ACCESS) which essentialy would mean FULL DUPLEX is possible with a single HID device interface.

I have to delay this test, as it turns out that the WriteAsync method of my MemoryStream objects is only available with NET 4.5 and I lost backward compatibility to Windows 7, which I want to fix first.

Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher.

Sure, but moving away from HID would mean to get more loud (triggering Endpoint Protection on target). I like the HID approach very much and using ACM or other Communication Classes would be to easy ;-) .

mame82 commented 7 years ago

News:

Replacing AsyncWrite on my Background MemoryStreams with BeginWrite/EndWrite dropped my Transfer rate into the expected range:

End receiving /tmp/test received 1.048.517 Byte in 41,2182 seconds (24,84 KB/s)

Every call to BeginWrite assures that EndWrite is called for previous write, which I haven't done like this while using WriteAsync. Inspecting the source of the AsyncWrite implementation in MemoryStream.cs is seams every AsyncWrite is done in a seperate AsyncTask. I'm still not sure how my Transfer rate could grow above 32KB/s with AsyncWrite, as I'm waiting fo an answer to every packet sent before sending more data. Anyway, I updated my code and move on to prepare a concurrent read/write test.

Going to ping back with the results

RoganDawes commented 7 years ago

I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it.

mame82 commented 7 years ago

I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it.

Your're again right, benefit couldn't be measured, but I was wondering why I achieved this high rate with WriteAsync ?!?!

Anyway, meanwhile I've found the answer. Good News... believe it or not: The devicefile is FULL DUPLEX. You can send data in both directions with about 64 KB/s in parrallel

Test output powershell side (read and writing at the same time):

PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
16384 reports have been read in 16.5860813 seconds (61.7385132436316 KB/s)
16385 reports have been written in 16.61491 seconds (61.635151800401 KB/s)
Killing remaining threads
Godbye

And the other end (python, first read is the Trigger to start threads):

Count 1 reports read in 0.000141143798828 seconds (0.0 KB/s)
Count 16384 reports written in 16.5881521702 seconds (61.730805788 KB/s)
Count 16383 reports read in 16.6130959988 seconds (61.5779262382 KB/s)

Althoug I didn't measured overall time, reading and writing have been done concurrently. The difference is, that I fully decoupled inbound and outbound data (no echo Server).

I haven't implemented any tests for packet loss, but sending and receiving threads have to have matching Report Counts in order to allow the threads to terminate. So it is very likely that there is no packet loss. Anyway, packet Content has to be checked on both ends (I'm sure that I ran into an issue, where riding a Report before reading the pending input Report cleared the input Report).

Testing this is easy, as i included routines to print out inbound data on both sides (disabled to reduce influence on time measurement).

Test code will be up in some minutes

mame82 commented 7 years ago

Next interesting Observation: Starting only the powershell side of communication (no python endpoint on Pi Zero) it turns out that write() is blocking to. If no reports are read on RPi's end, a Maximum of 4 reports could be written before write() blocks.

This means there should never be any packet (=Report) loss, which again means there's no Need to use ACKs on per Report Basis. One could simply reassemble reports to a bigger stream. No Problem on Linux, but PS is still missing something like FIFOMemoryStream (which in fact could be implemented esily on your side, as you use inline C-Sharp).

Test from PS printing out Report Count on write, without listener on RPi running:

PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
reports written 0
reports written 1
reports written 2
reports written 3

Note: None of the pending reports is lost, if the python side is started some minutes later

mame82 commented 7 years ago

@RoganDawes here're the test files https://github.com/mame82/tests/tree/master/fullduplex

I guess I have to reimplement everything, as alternating read/write is the worst approach.

Have you found a replacement for named pipes to Interface with upper protocol layers (some sort of FIFO Stream available on NET 3.5) ?

RoganDawes commented 7 years ago

Very interesting results! I guess my implementation was naive, as an echo server, introducing delays!

I never bother reassembling the stream, I simply write the data portion of the packets to their destination as I receive them. So I have no need for a FIFO memory stream at all.

The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either!

mame82 commented 7 years ago

The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either!

I haven't had packet loss at any time. This seems to be clear now, as write() is blocking if data isn't read on the other end. Read() was blocking, too.

mame82 commented 7 years ago

I never bother reassembling the stream, I simply write the data portion of the packets to their destination as I receive them. So I have no need for a FIFO memory stream at all.

Was thinking about a more common Interface. You acted with /dev/hidg1 and socat directly, which is the natural way on Linux - i was looking for the same on Windows (common Interface to pipe, for example into a socks5 proxy). But I guess this was more sort of dreaming than real thinking, at least in Microsoft world of things ;-)

mame82 commented 7 years ago

Very interesting results! I guess my implementation was naive, as an echo server, introducing delays!

This in fact leads to half transmit rate, as a full Report has to be received before something gets written (same as my now absolutly useless read-than-write)

RoganDawes commented 7 years ago

mmm. try putting a:

[System.Console]::Write("R")

in your read loop, and see what happens.

Better yet, try doing a

seq -f "%063g" 1 1000 > /dev/hidg1

on the Pi, when there is nothing running on the other end. If it finishes, the packets got lost . . . .

mame82 commented 7 years ago

Had put in this (Console::Write() doesn't work on PS ISE with threads):

# normal script block, should be packed into thread later on
$HIDinThread = {
    $hostui.WriteLine("Reading up to $in_count reports, with blocking read")

    $inbytes = New-Object Byte[] (65)

    $sw = New-Object Diagnostics.Stopwatch
    for ($i=0; $i -lt $in_count; $i++)
    {
        $cr = $HIDin.Read($inbytes,0,65)
        if ($i -eq 0) { $sw.Start() }
        $utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes)

        $hostui.WriteLine($utf8)
    }
    $sw.Stop()
    $timetaken = $sw.Elapsed.TotalSeconds
    $KBps = $in_count * 64 / 1024 / $timetaken
    $hostui.WriteLine("$in_count reports have been read in $timetaken seconds ($KBps KB/s)")
}

Result:

____________________________________________________________________________________________________________________________________________________________________________________
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Writing 16385 reports with synchronous 'Write'
Reading up to 16384 reports, with blocking read
 000000000000000000000000000000000000000000000000000000000000001

 000000000000000000000000000000000000000000000000000000000000002
... snip... (no loss)
 000000000000000000000000000000000000000000000000000000000000998

 000000000000000000000000000000000000000000000000000000000000999

 000000000000000000000000000000000000000000000000000000000001000

mame82 commented 7 years ago

As said, receive Count was capped by the ´for Loops, which only terminate if exactly the number of packets has been received which has been sent before.

One more not: Writing to /dev/hidg1 of course isn't blocking. If no reader is in place on he Windows end, the data is lost.

The only case of Report loss I could imagine, would be if reports are written on Linux end (slow RPi) and the Windows end is reading back to slow (unlikely but possible).

But you're again right: No listener on Windows = data loss

mame82 commented 7 years ago

@RoganDawes I guess the solution is here:

Started powershell threads first, but deployed a delay in read thread to force packet loss:

# normal script block, should be packed into thread later on
$HIDinThread = {
    $hostui.WriteLine("Reading up to $in_count reports, with blocking read")

    $inbytes = New-Object Byte[] (65)

    $sw = New-Object Diagnostics.Stopwatch
    for ($i=0; $i -lt $in_count; $i++)
    {
        $cr = $HIDin.Read($inbytes,0,65)
        if ($i -eq 0) { $sw.Start() }

        Start-Sleep -m 100 # try to miss reports
        $utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes)
        $hostui.WriteLine($utf8)
    }
    $sw.Stop()
    $timetaken = $sw.Elapsed.TotalSeconds
    $KBps = $in_count * 64 / 1024 / $timetaken
    $hostui.WriteLine("$in_count reports have been read in $timetaken seconds ($KBps KB/s)")
}

Additionally I added Console Output before the first Report is sent from PS out thread:


    for ($i=0; $i -lt $out_count; $i++)
    {
        if ($i -eq 0) { $hostui.WriteLine("Sending first report out on send thread")} # output is blocked by other thread if interacting with $hostui, so this line couldn't be placed exactly

        $HIDout.Write($outbytes,0,65)
        if ($i -eq 0) { $sw.Start() }
        #$hostui.WriteLine("reports written $i") # test how many reports are needed ill write() blocks if no receiver on other end
    }

Starting the PS Process first and running:

seq -f "%063g" 1 100 > /dev/hidg1

I get the following interesting result:


____________________________________________________________________________________________________________________________________________________________________________________
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
Sending first report out on send thread
 000000000000000000000000000000000000000000000000000000000000001

 000000000000000000000000000000000000000000000000000000000000069

 000000000000000000000000000000000000000000000000000000000000070

 000000000000000000000000000000000000000000000000000000000000071

 000000000000000000000000000000000000000000000000000000000000072

 000000000000000000000000000000000000000000000000000000000000073

 000000000000000000000000000000000000000000000000000000000000074

 000000000000000000000000000000000000000000000000000000000000075

 000000000000000000000000000000000000000000000000000000000000076

 000000000000000000000000000000000000000000000000000000000000077

 000000000000000000000000000000000000000000000000000000000000078

 000000000000000000000000000000000000000000000000000000000000079

 000000000000000000000000000000000000000000000000000000000000080

 000000000000000000000000000000000000000000000000000000000000081

 000000000000000000000000000000000000000000000000000000000000082

 000000000000000000000000000000000000000000000000000000000000083

 000000000000000000000000000000000000000000000000000000000000084

 000000000000000000000000000000000000000000000000000000000000085

 000000000000000000000000000000000000000000000000000000000000086

 000000000000000000000000000000000000000000000000000000000000087

 000000000000000000000000000000000000000000000000000000000000088

 000000000000000000000000000000000000000000000000000000000000089

 000000000000000000000000000000000000000000000000000000000000090

 000000000000000000000000000000000000000000000000000000000000091

 000000000000000000000000000000000000000000000000000000000000092

 000000000000000000000000000000000000000000000000000000000000093

 000000000000000000000000000000000000000000000000000000000000094

 000000000000000000000000000000000000000000000000000000000000095

 000000000000000000000000000000000000000000000000000000000000096

 000000000000000000000000000000000000000000000000000000000000097

 000000000000000000000000000000000000000000000000000000000000098

 000000000000000000000000000000000000000000000000000000000000099

 000000000000000000000000000000000000000000000000000000000000100

Seems there's no Report loss if first send has taken place from Windows side

mame82 commented 7 years ago

Last assumption is wrong:

000000000000000000000000000000000000000000000000000000000000081

 000000000000000000000000000000000000000000000000000000000000183

 000000000000000000000000000000000000000000000000000000000000290

 000000000000000000000000000000000000000000000000000000000000393

 000000000000000000000000000000000000000000000000000000000000495

So ACKs have to be sent from Pi to Windows :-(

Assuring that Windows end reads fast enough couldn't be done reliably otherwise, I guess

RoganDawes commented 7 years ago

Looks like you lost 68 reports to me?

On Mon, 3 Apr 2017, 17:25 mame82, notifications@github.com wrote:

@RoganDawes https://github.com/RoganDawes I guess the solution is here:

Started powershell threads first, but deployed a delay in read thread to force packet loss:

normal script block, should be packed into thread later on

$HIDinThread = { $hostui.WriteLine("Reading up to $in_count reports, with blocking read")

$inbytes = New-Object Byte[] (65)

$sw = New-Object Diagnostics.Stopwatch
for ($i=0; $i -lt $in_count; $i++)
{
    $cr = $HIDin.Read($inbytes,0,65)
    if ($i -eq 0) { $sw.Start() }

    Start-Sleep -m 100 # try to miss reports
    $utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes)
    $hostui.WriteLine($utf8)
}
$sw.Stop()
$timetaken = $sw.Elapsed.TotalSeconds
$KBps = $in_count * 64 / 1024 / $timetaken
$hostui.WriteLine("$in_count reports have been read in $timetaken

seconds ($KBps KB/s)") }

Additionally I added Console Output before the first Report is sent from PS out thread:

for ($i=0; $i -lt $out_count; $i++)
{
    if ($i -eq 0) { $hostui.WriteLine("Sending first report out on

send thread")} # output is blocked by other thread if interacting with $hostui, so this line couldn't be placed exactly

    $HIDout.Write($outbytes,0,65)
    if ($i -eq 0) { $sw.Start() }
    #$hostui.WriteLine("reports written $i") # test how many

reports are needed ill write() blocks if no receiver on other end }

Starting the PS Process first and running:

seq -f "%063g" 1 100 > /dev/hidg1

I get the following interesting result:

PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1 Path: \?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030} Invalid handle Path: \?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-

f16f-11cf-88cb-001111000030} Input: 65, Output: 65 Reading up to 16384 reports, with blocking read Writing 16385 reports with synchronous 'Write' Sending first report out on send thread 000000000000000000000000000000000000000000000000000000000000001

000000000000000000000000000000000000000000000000000000000000069

000000000000000000000000000000000000000000000000000000000000070

000000000000000000000000000000000000000000000000000000000000071

000000000000000000000000000000000000000000000000000000000000072

000000000000000000000000000000000000000000000000000000000000073

000000000000000000000000000000000000000000000000000000000000074

000000000000000000000000000000000000000000000000000000000000075

000000000000000000000000000000000000000000000000000000000000076

000000000000000000000000000000000000000000000000000000000000077

000000000000000000000000000000000000000000000000000000000000078

000000000000000000000000000000000000000000000000000000000000079

000000000000000000000000000000000000000000000000000000000000080

000000000000000000000000000000000000000000000000000000000000081

000000000000000000000000000000000000000000000000000000000000082

000000000000000000000000000000000000000000000000000000000000083

000000000000000000000000000000000000000000000000000000000000084

000000000000000000000000000000000000000000000000000000000000085

000000000000000000000000000000000000000000000000000000000000086

000000000000000000000000000000000000000000000000000000000000087

000000000000000000000000000000000000000000000000000000000000088

000000000000000000000000000000000000000000000000000000000000089

000000000000000000000000000000000000000000000000000000000000090

000000000000000000000000000000000000000000000000000000000000091

000000000000000000000000000000000000000000000000000000000000092

000000000000000000000000000000000000000000000000000000000000093

000000000000000000000000000000000000000000000000000000000000094

000000000000000000000000000000000000000000000000000000000000095

000000000000000000000000000000000000000000000000000000000000096

000000000000000000000000000000000000000000000000000000000000097

000000000000000000000000000000000000000000000000000000000000098

000000000000000000000000000000000000000000000000000000000000099

000000000000000000000000000000000000000000000000000000000000100

Seems there's no Report loss if first send has taken place from Windows side

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sensepost/USaBUSe/issues/15#issuecomment-291177650, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJwi7SSO5kqpBO_UhEbDbsNl9ukEHUWks5rsQ9zgaJpZM4MuqqN .

mame82 commented 7 years ago

Looks like you lost 68 reports to me?

Indeed, till first Report was sent from Windows.

But raising the simulated read delay, produce more Report loss (assumption on send is needed was wrong).

So again I struggle on Windows, as I know Linux isn't able to send if nothing is received (remeber Crash on unresponsive IRQ when sending data to /dev/hidg1 before Windows is able to read).

So one has to assure one read per millisecond on Windows or use ACKs from RPi to Windows.

Seems your P-ACK idea is the best way to do this, again you're right

RoganDawes commented 7 years ago

So, as discussed, let Windows be the first to send a packet. Once the Linux side has received a packet, you know that the Windows side is ready to receive, and communications can begin.

Alternatively, you can monitor the dmesg output to see when the relevant USB configuration has been selected by the Windows host to know that the endpoint has been "activated". However, it still doesn't mean that the powershell is running yet - this you can "discover" by waiting for the powershell to send you the first packet!

mame82 commented 7 years ago

Yes, I'm already doing this, both in production code and in the full duplex example provided above. Anyway I was wrong assuming that reports aren't missed when sending is started from Windows. It is simple... Linux writes to HID device non-blocking and reading from FileStream misses reports if done to slow. This isn't the case for sending from Windows to Linux. While searching for a solution to replace FileStream.read() with something which gives access to the underlying buffer (to block writing on the other end if buffer is full) I stumpled across feature reports. Not only they don't rely on FileStream, the are handled with control transfers instead of IRQ. So if I'm right the 1000 reports per second boundary doesn't apply on feature reports.

I'm thinking about new tests, changing underlying mechanics away from inpit/output reports, but I'm running out of time.

RoganDawes commented 7 years ago

Well, go with the 32kBps option in the meantime, version 2 can get higher speed ;-)

mame82 commented 7 years ago

Thumps up for this comment...I'm already suffering from tunnel-vision trying to optimize low level HID communications while loosing focus on other things which I wanted to implement in P4wnP1. But it doesn't get boring, another funny thing is that I'm faking RNDIS to run on 20 GBit / s which involves different issues. If you're interested in this, here's the link https://github.com/mame82/ratepatch (applies on raspbian with kernel 4.4.50+)

mame82 commented 7 years ago

Not suprisingly I'm still thinking about Report loss occuing when writing to Linux /dev/hidg and reading back from powershell to slow.

So please excuse the next large paste of test Output. I observed the following. Running the read Loop on Windows with a 500 ms delay, Report loss is assured. I now started such a read Loop on PS and large chunks of Output reports from Linux. The Report Content is "Number xxx" - xxx represent the Report number written. The read Loop outputs the Report Content and a number representing the Count of the read Loop

New Observation: if sending is aborted on Linux side, the last 32 reports could be readen back (with 500 ms delay) without having any loss. This means the FileStream is backed by a 2048 Byte buffer. If it would be possible to Access this buffer directly from powershell, a notification could be sent back to block writing on the other side (including last seq number received). Unfortunately, I haven't found a way to Access the underlying buffer... FileStream.Position and FileStream.Length are both unset.

So if you're going to implement your P-ACK idea, it seems 32 reports is the magic number to track end send ACKs for. So your sequence number misses exactly one bit to cope with that.

Here's the testoutput showing the described behaviour. The parts where large amounts of reporst are missed, have been caused by unlimited sending. The parts with continues Report received are the result of manually Abort sending from Linux side (last 32 reports are reconstructed with 500 ms delay)

 Number 0                                                        
0
 Number 327                                                      
1
 Number 594                                                      
2
 Number 595                                                      
3
 Number 596                                                      
4
 Number 597                                                      
5
 Number 598                                                      
6
 Number 599                                                      
7
 Number 600                                                      
8
 Number 601                                                      
9
 Number 602                                                      
10
 Number 603                                                      
11
 Number 604                                                      
12
 Number 605                                                      
13
 Number 606                                                      
14
 Number 607                                                      
15
 Number 239                                                      
16
 Number 364                                                      
17
 Number 365                                                      
18
 Number 366                                                      
19
 Number 367                                                      
20
 Number 368                                                      
21
 Number 369                                                      
22
 Number 370                                                      
23
 Number 371                                                      
24
 Number 372                                                      
25
 Number 373                                                      
26
 Number 374                                                      
27
 Number 375                                                      
28
 Number 376                                                      
29
 Number 377                                                      
30
 Number 378                                                      
31
 Number 379                                                      
32
 Number 380                                                      
33
 Number 381                                                      
34
 Number 382                                                      
35
 Number 383                                                      
36
 Number 384                                                      
37
 Number 385                                                      
38
 Number 386                                                      
39
 Number 387                                                      
40
 Number 388                                                      
41
 Number 389                                                      
42
 Number 390                                                      
43
 Number 391                                                      
44
 Number 392                                                      
45
 Number 393                                                      
46
 Number 394                                                      
47
 Number 395                                                      
48
 Number 0                                                        
49
 Number 306                                                      
50
 Number 651                                                      
51
 Number 994                                                      
52
 Number 1045                                                     
53
 Number 1046                                                     
54
 Number 1047                                                     
55
 Number 1048                                                     
56
 Number 1049                                                     
57
 Number 1050                                                     
58
 Number 1051                                                     
59
 Number 1052                                                     
60
 Number 1053                                                     
61
 Number 1054                                                     
62
 Number 1055                                                     
63
 Number 1056                                                     
64
 Number 1057                                                     
65
 Number 1058                                                     
66
 Number 1059                                                     
67
 Number 1060                                                     
68
 Number 1061                                                     
69
 Number 1062                                                     
70
 Number 1063                                                     
71
 Number 1064                                                     
72
 Number 1065                                                     
73
 Number 1066                                                     
74
 Number 1067                                                     
75
 Number 1068                                                     
76
 Number 1069                                                     
77
 Number 1070                                                     
78
 Number 1071                                                     
79
 Number 1072                                                     
80
 Number 1073                                                     
81
 Number 1074                                                     
82
 Number 1075                                                     
83
 Number 1076                                                     
84
 Number 0                                                        
85
 Number 1                                                        
86
 Number 2                                                        
87
 Number 3                                                        
88
 Number 4                                                        
89
 Number 5                                                        
90
 Number 6                                                        
91
 Number 7                                                        
92
 Number 8                                                        
93
 Number 9                                                        
94
 Number 10                                                       
95
 Number 11                                                       
96
 Number 12                                                       
97
 Number 13                                                       
98
 Number 14                                                       
99
 Number 15                                                       
100
 Number 16                                                       
101
 Number 17                                                       
102
 Number 18                                                       
103
 Number 19                                                       
104
 Number 20                                                       
105
 Number 21                                                       
106
 Number 22                                                       
107
 Number 23                                                       
108
 Number 24                                                       
109
 Number 25                                                       
110
 Number 26                                                       
111
 Number 27                                                       
112
 Number 28                                                       
113
 Number 29                                                       
114
 Number 30                                                       
115
 Number 31                                                       
116
 Number 0                                                        
117
 Number 127                                                      
118
 Number 128                                                      
119
 Number 129                                                      
120
 Number 130                                                      
121
 Number 131                                                      
122
 Number 132                                                      
123
 Number 133                                                      
124
 Number 134                                                      
125
 Number 135                                                      
126
 Number 136                                                      
127
 Number 137                                                      
128
 Number 138                                                      
129
 Number 139                                                      
130
 Number 140                                                      
131
 Number 141                                                      
132
 Number 142                                                      
133
 Number 143                                                      
134
 Number 144                                                      
135
 Number 145                                                      
136
 Number 146                                                      
137
 Number 147                                                      
138
 Number 148                                                      
139
 Number 149                                                      
140
 Number 150                                                      
141
 Number 151                                                      
142
 Number 152                                                      
143
 Number 153                                                      
144
 Number 154                                                      
145
 Number 155                                                      
146
 Number 156                                                      
147
 Number 157                                                      
148
 Number 158                                                      
149

Here's the example output

RoganDawes commented 7 years ago

Interesting! So, if I limited my "packets in flight without ACK" to 16 (max of my sequence numbers), I could be sure that there would be no packet loss. Funnily enough, I instrumented my "echo loop" to indicate how many reports there were in the queue at the beginning of the while loop. Not once did I get more than 16 reports, with a 1ms sleep once the queue was drained.

The unfortunate part is that my sequence numbers are per "connection", of which I can have up to 255 at once (in theory). So I'd have to track the unacknowledged packets at a different level. Which unfortunately, is a bit of a layering violation, I think.

I think the "solution" is going to be making sure that the read loop just reads as fast as possible, and if any packets are observed to be missing, to send a RST on that channel, and let it start again. Not particularly robust, but should work, I hope!

RoganDawes commented 7 years ago

FWIW, by simply substituting the $device.BeginRead/EndRead pairs with dequeueing packets from the readloop/queue, I managed to get 13kBps throughput with a cmd.exe doing "dir /s". Strangely, when writing the packets out to a socket, the throughput dropped to about 8kBps.

mame82 commented 7 years ago

@RoganDawes

While starting to implement a new lower layer communication scheme, based on our observations I put some comments (design ideas) into the source of the concurrent read/write testcase (the one with 64000 Bytes/s full Duplex on a single device file), to avoid report loss.

As this isn't implemented in 5 minutes, I'd like to kindly ask you to Review These comments before I start coding (and maybe throw it away in the end).

Idea (From Linux Point of view = USB device, not Host):

# Writing out reports doesn't mean that the receiver is able to read
# them back (if reading to slow, writing from this side isn't blocking)
#
# If the receiver is Windows via FileStream object, it was obeserved that
# exactly 2048 bytes = 32 reports are cached in a ring buffer, which gets
# overwritten if more reports are sent before reading themm back
#
# To assure every report is readen a report loss detection is applied
# only for reports written to the host (HID input reports) as it has been
# observed that reports readen from the host (OUTPUT reports) don't get lost
# (write call to HID device FileStream blocks after writing 4 reports
# without reading them back on this end)
#
# So outgoing sequence numbers are deployed, reaching from 0 to 31 to
# match the FileStream Buffer on windows.
# Outgoing report format is (INPUT REPORT for host):
#       0:      length (effective payload length in report, excluding header)
#       1:      seq (outgoing sequence number)
#       2:      src (like source port, but 0..255 - should maybe moved to an upper layer)
#       3:      dst (like destination port, but 0..255 - should maybe moved to an upper layer)
#       4..63:  payload, padded with zeroes if needed
#       Note: report ID isn't needed at gadget side, thus report size is 64 bytes

# Incoming report format is (OUTPUT REPORT from host):
#       0:      length (effective payload length in report, excluding header)
#       1:      ack (acknowledge number, holding the last SEQ number the host has read back)
#       2:      src (like source port, but 0..255 - should maybe moved to an upper layer)
#       3:      dst (like destination port, but 0..255 - should maybe moved to an upper layer)
#       4..63:  payload, padded with zeroes if needed
#       Note: report ID isn't needed at gadget side, thus report size is 64 bytes

# packets are constantly read and written. This could be seen as carrier
# Both peers are able to send data at any time, by putting it into the report payload (length > 0)
# If no data should be sent from one peer, an empty report (length = 0) is sent anyway, to assure
# continuous delivering o sSEQ and ACK numbers.
#
# The USB client (this device) is only allowed to send up to 32 reports (MAX_SEND) without
# receiving an ACK. These up to 32 reports are cached in an outbound queue, to allow resending if needed.
# This qualifies for handling of the following error cases:

# Error case 1:
#       The last ACK received with an OUTPUT REPORT isn't the next one awaited. As no scheduling or
#       priorization functionallity is introduced in this protocol layer, received ACKs occure in sequential
#       manner always (ACKs are never lost, as HOST to device communication is save in terms of report loss).
#       Examples for valid ACK sequences:
#               0, 1, 2, 3 ...
#               30, 31, 0, 1 ...
#       Example for invalid ACK sequences:
#               0, 1, 2, 5
#               31, 1
#       Note: The corner case, that exactly 33 rerports (or n%32+1 reports) are missed, would lead to an ACK
#       sequence like this:     29, 30 ... miss 33 ... ,2. This in fact could never happen, as writing reports
#       without received ACK is blocked if the outbound queue grows to 32.
#
# Error case 1 detection:
#       awaited_ACK = (last_ACK + 1) % 32       # modulo could be replaced with: if (awaited_ACK > 31) awaited_ACK -= 32
#       received_ACK != awaited_ACK
#
# Error case 1 cause:
#       The USB host (receiver of input report) missed all reports sent, starting from the last valid
#       ACK received
#       Example:
#               last_valid_ACK = 10
#               received_ACK = 2
#               last_seq_sent = 8
#
#               Input reports have been sent to the host, up to SEQ number 8.
#               The host received valid reports up to last_valid_ACK=10 (last valid SEQ number seen by the receiving host was 10).
#               The next report received by the host is received_ACK=2 (last SEQ number seen by the receiving host was 2, the host
#               already is aware of the facted, that this wasn't the right SEQ number and ignores these packets).
#
#               At this point, it is obvious that the receiving host missed the following reports:
#                       11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1 (last_valid_ACK + 1 ... received_ACK - 1)
#               At this point it isn't known if the host missed the reports 3 to 8, which have already been sent. Thus it is decalred,
#               that THE RECEIVER HAS TO IGNORE RPORTS WITH OUT-OF-BOUND SEQ NUMBER. (Another approach would be to allow th receiver to cache out-of-bound
#               reports, in order to only resend reports which have been lost. This would come at the cost of additional logic and resending "old"
#               reports would cause out-of-nound sequence numbers itself, resulting in more complex error detection)
#
#               To cope with that report loss, reports 11 ... 1 have to be sent, followed by reports 2 ... 10 as these have been ignored by the receiver.
#
#               This isn't the most efficient approach, as in worst case missing a single report on receiver side could lead to resending up to
#               30 reports, which already had been sent (and ignored by the receiver). But as the first missing report is always send first and ACKs
#               are received from a parallel thread, this comes down to a "1 report sent / 1 report acknowledged" case (hopefully, has to be tested)
#
# Error case 1, action to take:
#       If out-of-bound ACK is received, all reports from last_valid_ACK+1 to last_seq_sent have to be retransmitted by the sender (USB device).
#       If the receiver (USB host) recognizes an out-of-bound SEQ number, the report is ignored
#

# Additional note:
#  The design ideas describe apply to the write thread of the USB device (writing INPUT reports) and the read thread of the USB host (reading INPUT reports).
#  Anyway, the ACKs are carried with output reports. As read and write loops are ran in independent threads on both peers, states like "last_valid_ACK"
#  and "last_SEQ_sent" are kep in synchronized global sate objects, shared between the read and write thread of a peer.
#  Otherwise read and write threads are decoupled and independent, there's nothing like write-after-read (to hopefully achieve maximum throughput, at least
#  for output reports). This again means, that if a SEQ number is sent, it could take several incoming packets, till the ACK is received (up to 32 before
#  sending from device to host is blocked).

mame82 commented 7 years ago

Good news, implemented FULL DUPLEX similar to suggestion from above with some some improvments.

Results:

PS D:\P4wnP1\powershell> D:\P4wnP1\powershell\fullduplex\fullduplex4.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#7&27da95e8&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Starting thread to continuously read HID input reports
Starting write loop continously sending HID ouput report
Global seq number readen 31
MainThread: received report: port nr. 0                                                  
MainThread: received report: port nr. 1                                                  
MainThread: received report: port nr. 2                                                  
... snip ... (no report loss in between)
MainThread: received report: port nr. 17405                                              
MainThread: received report: port nr. 17406                                              
MainThread: received report: port nr. 17407                                              
Total time in seconds 22.2611788
Throughput in 45731,6303483444 bytes/s netto payload date (excluding report loss and resends)
Throughput out in the same time 53357,2822298161 bytes/s netto output (19158 reports)
Killing remaining threads
Godbye

So I'm on ~45500 Bytes/s from Pi to Powershell (real netto data, without protocol headers) and on ~53000 Bytes/s from PowerShell to Pi (concurrent full duplex read/write on single HID device file).

Report loss detection (includes blocking if Output buffer reached 32 Reports which haven't been read and resending of unacknowledged reports) is only done from Pi to Windows (HID Input Reports), as we know the other way around writes are blocked if no data is read back (assuzmption still holds true in all Tests, max, 4 Report writes to FileStream without read on Linux end).

Protocol overhead is reduced to 2 header bytes on "link layer" so payload size is 62 Bytes per report.

If you're interested in work in Progress code, ping back.

Btw. I decided to Interface to upper layers with synchronized input/output queues consuming/holding pure Reports...this still isn't fully implemented. Fragmentation/defragmentation of larger streams is going to be handled in upper layers (based on a FIN bit in reports). DST / SOURCE fields (or channel in your case) will be moved to upper layers, too. I don't nned this information on link layer anymore. Reason endpoints of this layer are well defined and pre-known USB-Host<-->USB-Device (or PowerShell-Client <--> Python-Server)

mame82 commented 7 years ago

Forgot to mention, measurement was on win 10 64 bit. On Win 7 32 Bit throughput is far slower (going to test tomorrow, code is PS 2.0 and NET 3.5 compatible, not sure if this is the bottleneck on Win 7)

RoganDawes commented 7 years ago

I think the basic idea is solid. My approach is to have a single "channel identifier", resulting in a max of 256 (concurrent!) channels, rather than 65536, which I think is reasonable in the circumstances.

Packet format in my case (following some of your ideas above, so not implemented yet) would look like:

seq no (1 byte) - value between 0-255
channel no (1 byte) - value between 0-255 allows up to 256 channels simultaneously, and can obviously be reused.
payload length (1 byte) - value between 0-(HID Packet size - 3)
data ((HID Packet size - 3) bytes) - zero padded to fill up the entire packet where necessary.

I'm still not 100% convinced about acking every packet - i.e. making a continuous stream of comms at full rate, as this will require fairly significant resources to keep up, possibly resulting in suspicious activity on the victim being detected.