Open mame82 opened 7 years ago
@RoganDawes Thought opening a new issue is a good idea before going off Topic too much. I haven't much time for coding right now, but wanted to test the idea discussed.
So here are the new descriptors for the two Composite HID functions:
Input to host:
0x06, 0x00, 0xFF, // Usage Page (Vendor Defined 0xFF00)
0x09, 0x01, // Usage (0x01)
0xA1, 0x01, // Collection (Application)
0x09, 0x01, // Usage (0x01)
0x15, 0x00, // Logical Minimum (0)
0x26, 0xFF, 0x00, // Logical Maximum (255)
0x75, 0x08, // Report Size (8)
0x95, 0x40, // Report Count (64)
0x81, 0x02, // Input (Data,Var,Abs,No Wrap,Linear,Preferred State,No Null Position)
0xC0, // End Collection
Output from host:
0x06, 0x00, 0xFF, // Usage Page (Vendor Defined 0xFF00)
0x09, 0x01, // Usage (0x01)
0xA1, 0x01, // Collection (Application)
0x09, 0x02, // Usage (0x02)
0x15, 0x00, // Logical Minimum (0)
0x26, 0xFF, 0x00, // Logical Maximum (255)
0x75, 0x08, // Report Size (8)
0x95, 0x40, // Report Count (64)
0x91, 0x02, // Output (Data,Var,Abs,No Wrap,Linear,Preferred State,No Null Position,Non-volatile)
0xC0, // End Collection
Not sure if using a collection is still necessary and if usage on the second descripor really Needs to be changed to two, but they should work.
And here Comes the first problem, thanks to powershell. Need to find a way to distinguish between in and out Interface, not sure if this is possible with WMI. Test code:
function GetDevicePath($USB_VID, $USB_PID)
{
$HIDGuid="{4d1e55b2-f16f-11cf-88cb-001111000030}"
foreach ($wmidev in gwmi Win32_USBControllerDevice |%{[wmi]($_.Dependent)} ) {
#[System.Console]::WriteLine($wmidev.PNPClass)
if ($wmidev.DeviceID -match ("$USB_VID" + '&PID_' + "$USB_PID") -and $wmidev.DeviceID -match ('HID') -and -not $wmidev.Service) {
$devpath = "\\?\" + $wmidev.PNPDeviceID.Replace('\','#') + "#" + $HIDGuid
"Matching device found $wmidev"
}
}
#$devpath
}
$USB_VID="1D6B"
$USB_PID="fdde" # full duplex device ;-)
GetDevicePath $USB_VID $USB_PID
Result:
Matching device found \\WUNDERLAND-PC\root\cimv2:Win32_PnPEntity.DeviceID="HID\\VID_1D6B&PID_FDDE&MI_03\\8&B609427&0&0000"
Matching device found \\WUNDERLAND-PC\root\cimv2:Win32_PnPEntity.DeviceID="HID\\VID_1D6B&PID_FDDE&MI_04\\8&2F37D1E9&0&0000"
Please ignore my nice hostname ;-)
So if this would be Linux, I guess I would be able to check which device file is readable and which is writable. As this unfortunately is Windows, I have to provide this Information to createfile. I'm going to check the Win32_USBControllerDevice attributes for usefull Information on this tomorrow. Worst case: using HIDD* win32 methods for enumeration would be needed.
Let me know if you have any ideas on this
I honestly don't think it is necessary to have two raw hid interfaces, although technically it may be possible to double your throughput as a result. I think the real problems are shitty powershell and lack of "streaming". If you get 1000 packets per second, each packet has to go within 1 ms of each other. However, I measured latencies of 10-20 ms just changing from writing to reading in Powershell, which kills your throughput right there.
Making the protocol less chatty, i.e. having the sender continue to send until the receiver indicates to slow down seems like the way to go!
Hi @RoganDawes
As promised I've done some tests on synchronous transfers using two separate Interfaces.
technically it may be possible to double your throughput as a result
You're absolutly right on this.
I've done 4 tests from Powershell:
1) Writing out 1000 64byte reports on dedicated HID out Interface, result: About 8 seconds (= 8Kbyte/s)
2) Writing out 1000 64byte reports on dedicated HID out Interface, echo them back and and reading them from a dedicated HID in Interface via seperate thread. Result: Again about 8 seconds (= 8Kbyte/s) reading back input data while writing data out has no Speed Impact
Test 3 and 4 have been the same, but I was hoping that the FileStream could write up to 8 concurrent reports, as they are created with FILE_FLAG_OVERLAPPED
. The results are still disappointing, transfering 64KByte takes about 100ms less time, Here's the testoutput of test 3 and 4:
Path: \\?\hid#vid_1d6b&pid_fdde&mi_02#8&2324206c&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_fdde&mi_03#8&b609427&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 0
Path: \\?\hid#vid_1d6b&pid_fdde&mi_04#8&2f37d1e9&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 0, Output: 65
Writing 1000 reports with synchronous 'Write'
Hello World
Hello World
.. snip ... (1000 Hello World from output thread, echoed back by bash via cat /dev/hidg2 > /dev/hidg1)
Hello World
HID out thread finfished, time taken 8,1890945 seconds
Writing 1000 reports with async 'BeginWrite', 8 concurrent writes
Hello World
.. snip ... (1000 Hello World from output thread, echoed back by bash via cat /dev/hidg2 > /dev/hidg1)
Hello World
HID concurrent output thread finfished, time taken 7,9576403 seconds
Killing remaining threads
Hello World
Godbye
To sum up: It seems the FileStream methods of .NET aren't able to reach the maximum transfer rate (1000 64byte reports per second on USB 2.0), no matter how hard I try. So i give up on synchronous Transfer, as the benefit is low while the effort is high (according that both of us have working implementations with multiple protocol layers).
i.e. having the sender continue to send until the receiver indicates to slow down
Considering my tests, I doubt that there would be a speed increase with this (at least not for P4wnP1, as HID communication always runs at maximum Speed, while the upper thread based layers work on demand). Here's the Output code, which has no console IO or Array creation Overhead, but reaches 8KB/s max:
$outbytes = New-Object Byte[] (65)
$msg=[system.Text.Encoding]::ASCII.GetBytes("Hello World")
for ($i=0; $i -lt $msg.Length; $i++) { $outbytes[$i + 1] = $msg[$i] }
for ($i=0; $i -lt 1000; $i++)
{
$HIDout.Write($outbytes,0,65)
}
And here's my testscript, use it as you need to. I solved the problem of enumerating device Interfaces based on HID report descriptors, which took a ton of shitty csharp code. This is another reason to leave this path. The only useful thing about this code is that I'm able to spot my Composite device based on Serial + manufacturer string, wwhich isn't possible with WMI numeration (Strings for interface drivers are different), which is nice because as said I often Change VID/PID, but again crating a temporary csharp file for inline compilation renders this useless.
So I guess im stuck at ~4Kbyte Maximum synchronous Transfer or could achieve ~8Kbytes at the costs of shitty NET code (while consuming additional USB EPs). Maybe I'll use dedicated input / output reports for faster file transfer later on and I'll still be slower than my first Modem ;-).
Best regards and thanks for the Exchange on this.
P.S. Excuse typos, my spellcheck is fighting against English language
Hi,
It is certainly NOT the case that Windows/Powershell cannot achieve higher than what you are currently able to get. I have been able to get up to 48kBytes/second using powershell code (however, it was unreliable/inconsistent, and didn't include all the "refinements" required for multiple concurrent connections).
It did maintain that rate for at least several seconds, so I don't think that this is an unreachable goal.
Rogan
Could you provide a snippet of code reaching this rate, the snippet I provided above has no (obvious) room for improvement (writes data out in minimal loop)
Here is some real sample code, and some numbers to go with it:
On Windows:
speedtest.ps1
$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
public class w {
[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);
public static FileStream o(string fn) {
return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
}
}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs
& {
$devs = gwmi Win32_USBControllerDevice
foreach ($dev in $devs) {
$wmidev = [wmi]$dev.Dependent
if ($wmidev.GetPropertyValue('DeviceID') -match ('1209&PID_6667') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
}
}
try {
$f = [n.w]::o($fn)
$d = New-Object IO.MemoryStream
$c = 0
$b = New-Object Byte[]($M+1)
$sw = $null
while($c -lt 1024 * 1024) {
$r = $f.Read($b, 0, $M+1)
if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
$d.Write($b,1, $M)
$c += $M
}
$sw.Stop()
$sw.Elapsed
$d.Length
([Text.Encoding]::ASCII).GetString($d.ToArray())
} catch {
echo $_.Exception|format-list -force
}
exit
}
This waits for the first successful read from the RAW HID interface and starts the stopwatch, then exits after 16384 iterations (1024*1024/64==16384)
I run it like so:
> powershell -exec bypass -file speedtest.ps1 > log.txt
On the Pi:
# time seq -f "%063g" 1 17000 > /dev/hidg1
This writes the numbers 1-17000, formatted into a 63 character zero-padded string (with a CR added to make 64 bytes) to the hid interface. I run longer than 16384 to account for any packets getting lost.
The results (with ^M's removed):
Days : 0
Hours : 0
Minutes : 0
Seconds : 16
Milliseconds : 402
Ticks : 164026490
TotalDays : 0.000189845474537037
TotalHours : 0.00455629138888889
TotalMinutes : 0.273377483333333
TotalSeconds : 16.402649
TotalMilliseconds : 16402.649
1048576
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
...
000000000000000000000000000000000000000000000000000000000016383
000000000000000000000000000000000000000000000000000000000016384
In other words, 1MB successfully transferred (with no lost packets, since the last number is indeed 16384) in 16.4 seconds, a total rate of 63 937 bytes/second.
What gets interesting is adding a Write statement into the powershell after $d.Write(), that writes the received bytes back to the device, then updating the command on the pi to:
time seq -f "%063g" 1 16384 | socat - /dev/hidg1 > t
In theory, seq would generate 16384 lines, the powershell would echo them one at a time, and "t" would contain those 16384 lines.
In practice, I ended up with only around 2000 lines in "t", and the powershell waiting to receive its full 1MB. I even upped the number of lines to 65536, and the powershell still didn't terminate, indicating that even though I sent 4 times more data than required, it still didn't successfully receive a full 1MB :-(
So, this seems to be a fairly fundamental limitation of USB raw hid (perhaps only on Windows, and perhaps only with Powershell. More testing required!)
In fact, this might justify using two endpoints, one for reading, and one for writing, which I otherwise thought was a bad idea. ;-)
And this code, while not beautiful by any means, gets reasonable performance, while not losing any data!
$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
public class w {
[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);
public static FileStream o(string fn) {
return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
}
}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs
$readloop = {
Param($M, $f, $q)
try {
$d = New-Object IO.MemoryStream
$c = 0
$sw = $null
while($c -lt 1024*1024) {
$b = New-Object Byte[]($M+1)
$r = $f.Read($b, 0, $M+1)
if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
$d.Write($b,1, $M)
$c += $M
# [System.Threading.Monitor]::Enter($q)
$q.Enqueue($b)
# [System.Threading.Monitor]::Pulse($q)
# [System.Threading.Monitor]::Exit($q)
}
$sw.Stop()
$sw.Elapsed
$d.Length
([Text.Encoding]::ASCII).GetString($d.ToArray())
} catch {
$_.Exception|format-list -force
}
exit
}
$writeloop = {
Param($M, $f, $q)
try {
while ($true) {
# [System.Threading.Monitor]::Enter($q)
# [System.Threading.Monitor]::Wait($q)
[System.Console]::Write("!")
if ($q.Count -gt 0) {
[System.Console]::WriteLine($q.Count)
while ($q.Count -gt 0) {
$b = $q.Dequeue()
$f.Write($b, 0, $M+1)
}
}
Start-Sleep -m 10
# [System.Threading.Monitor]::Exit($q)
}
} catch {
[System.Console]::WriteLine("Write Thread Done!")
$_.Exception
}
exit
}
$Q = New-Object System.Collections.Queue
$Q = [System.Collections.Queue]::Synchronized($Q)
$devs = gwmi Win32_USBControllerDevice
foreach ($dev in $devs) {
$wmidev = [wmi]$dev.Dependent
if ($wmidev.GetPropertyValue('DeviceID') -match ('1209&PID_6667') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
}
}
$f = [n.w]::o($fn)
$readThread = [PowerShell]::Create()
[void] $readThread.AddScript($readloop)
[void] $readThread.AddParameter("M", $M)
[void] $readThread.AddParameter("f", $f)
[void] $readThread.AddParameter("q", $q)
$writeThread = [PowerShell]::Create()
[void] $writeThread.AddScript($writeloop)
[void] $writeThread.AddParameter("M", $M)
[void] $writeThread.AddParameter("f", $f)
[void] $writeThread.AddParameter("q", $q)
[System.IAsyncResult]$AsyncReadJobResult = $null
[System.IAsyncResult]$AsyncWriteJobResult = $null
try {
$AsyncWriteJobResult = $writeThread.BeginInvoke()
Sleep 1 # Wait 1 second to give some time for the write thread to be ready
$AsyncReadJobResult = $readThread.BeginInvoke()
Write-Host "Ready"
} catch {
$ErrorMessage = $_.Exception.Message
Write-Host $ErrorMessage
} finally {
if ($readThread -ne $null -and $AsyncReadJobResult -ne $null) {
$readThread.EndInvoke($AsyncReadJobResult)
$readThread.Dispose()
}
if ($writeThread -ne $null -and $AsyncWriteJobResult -ne $null) {
$writeThread.EndInvoke($AsyncWriteJobResult)
$writeThread.Dispose()
}
exit
}
On Linux:
pi@raspberrypi:~ $ time seq -f "%063g" 1 16384 | socat - /dev/hidg1 > t
real 0m40.080s
user 0m1.030s
sys 0m3.360s
pi@raspberrypi:~ $ wc -l t
16384 t
pi@raspberrypi:~ $
So, 40 seconds elapsed, to send 2MB back and forth (1MB in each direction), with no errors or lost packets. That's pretty good, I think!
I was hoping to use the Monitor class to allow one thread to notify the other, but I ended up with a deadlock, where the reader had already added an item and pulsed $q, while the writer had just ended the while loop and was getting around to calling Wait($q). Since the reader had read the last packet, there were no more "pulse"s sent, and the writer waited forever, even though there was actually one last packet in the queue.
I'll test this with the two endpoint script provided above. It's likely that I missed the transfer interruption which occurred in your test, because I terminated the loops after 1000 reports. Additionally I have to test the code provided by you and try to work out what causes the transfer rate difference. I'm not able to work on this during weekend, but will report back on both next week.
Although I hadn't time to fully dive into you code it seems you could change it. The $q object is already thread safe I think (using it on my stage 2 without manual synchronizing without issues). As you applied a 10 ms sleep at the write loop, you could move this sleep into an else branch of the condition if ($q.Count -gt 0)
. Thus you would react on enqueued data with a 10ms delay (only if $q.count has fallen to 0, not if there's continuous data in the queue). After doing this it should work without Monitor lock (trigger based on synchronized $q.Count).
I've done this here (line 233 and 310) but with a 50 ms delay. CPU consumption goes below 1 percent with the sleep. Sleep has no impact on throughput of continuous data, but again this is on an upper layer.
So I'm going to report back next week...promise
Ignore the last comment. I missed that the inner while loop empties the queue before the sleep is called. You've been looking for a way to avoid polling queue count via monitor, as far as I understand.
Although I should do other things right now, I couldn't stop thinking about your examples. So I tested the following code (only changed PID/VID and added early out to WMI device enumeration):
$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
public class w {
[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);
public static FileStream o(string fn) {
return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
}
}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs
& {
$devs = gwmi Win32_USBControllerDevice
foreach ($dev in $devs) {
$wmidev = [wmi]$dev.Dependent
if ($wmidev.GetPropertyValue('DeviceID') -match ('1D6B&PID_0137') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
break # second dev string invalid handle
}
}
try {
$f = [n.w]::o($fn)
$d = New-Object IO.MemoryStream
$c = 0
$b = New-Object Byte[]($M+1)
$sw = $null
while($c -lt 1024 * 1024) {
#[Console]::WriteLine("$c")
$r = $f.Read($b, 0, $M+1)
if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
$d.Write($b,1, $M)
$c += $M
}
$sw.Stop()
$sw.Elapsed
$d.Length
([Text.Encoding]::ASCII).GetString($d.ToArray())
} catch {
echo $_.Exception|format-list -force
}
exit
}
On my first attempt I thought it wasn't working (thats why I added in the printout of $c). Leaving the Code running for more than a minute, I recognized that my issue isn't PowerShell:
PS D:\P4wnP1> D:\P4wnP1\powershell\tests\fastread_usabuse.ps1
Days : 0
Hours : 0
Minutes : 2
Seconds : 11
Milliseconds : 64
Ticks : 1310641048
TotalDays : 0,00151694565740741
TotalHours : 0,0364066957777778
TotalMinutes : 2,18440174666667
TotalSeconds : 131,0641048
TotalMilliseconds : 131064,1048
1048576
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
... snip ...
000000000000000000000000000000000000000000000000000000000016381
000000000000000000000000000000000000000000000000000000000016382
000000000000000000000000000000000000000000000000000000000016383
I'm still stuck at ~7,8 KBytes/s
I'm using usb_f_hid.ko
with libcomposite.ko
which is needed for the P4wnP1 Features, could it be that you're using g_hid.ko
?
I don't get it - why is my USB communication too slow :-(
Okay, you're using the same .
So I'm not sure where to go now, have to think about what causes the speed drop.
I'm exactly at 1/8th of your Speed (8ms per report read/write). This explains my 3,2KBytes/s upper bound. I use alternating read/write which consumes 16ms per 64byte Fragment, leaving me at an effective maximum of 4KBytes/s for one direction. This drops to 3,2 KByte/s because of my naive implementation for fragment reassembling.
Could we compare Raspberry specs:
root@p4wnp1:~# ls /sys/class/udc
20980000.usb
root@p4wnp1:~# uname -r
4.4.50+
root@p4wnp1:~# lsmod | grep hid
usb_f_hid 10837 6
libcomposite 49479 15 usb_f_ecm,usb_f_hid,usb_f_rndis
I hope the UDC is always the same ... still don't get it
I've found the root cause (wouldn't be able to sleep otherwise). New results:
PS C:\Users\XMG-U705> D:\P4wnP1\powershell\tests\fastread_usabuse.ps1
Days : 0
Hours : 0
Minutes : 0
Seconds : 16
Milliseconds : 400
Ticks : 164008304
TotalDays : 0,000189824425925926
TotalHours : 0,00455578622222222
TotalMinutes : 0,273347173333333
TotalSeconds : 16,4008304
TotalMilliseconds : 16400,8304
1048576
000000000000000000000000000000000000000000000000000000000000001
... snip ...
000000000000000000000000000000000000000000000000000000000016383
000000000000000000000000000000000000000000000000000000000016384
It was a really silly mistake. I used this code for my HID device
# create RAW HID function
# =======================================================
if $USE_RAWHID; then
mkdir -p functions/hid.g2
echo 1 > functions/hid.g2/protocol
echo 1 > functions/hid.g2/subclass
echo 8 > functions/hid.g2/report_length
cat $wdir/conf/raw_report_desc > functions/hid.g2/report_desc
fi
The report legth I used was exactly 1/8th of the size it should be (copy paste from HID Keyboard code).
So I'm at the same speed, as you now. I have a minor issue in my Code, because I reassemble the reports by manually creating new static arrays. This has a huge performace impact on large transfers (a static array with size=old_size+64
is created before concatenating new data) and hinders me in doing meaningfull Transfer tests. I'll do some new speed tests on file transfer after changing the report assembler to work with MemoryStream
instead.
I've already implemented a application layer function, which loads remote files to a dynamially generated powershell variable in the host process, which could be used for file transfer testing. I'm going to revisit the full dulpex approach with two EPs only if this function is working to slow (less than half of maximum transfer rate). Anything else should be doable with more code optimization. Continuous alternating read/write seems to be okay for me.
@RoganDawes I want to thank you very much for discussing these points, if you'd like me to do additional tests ask at any time (but please avoid forcing me to use an AVR ;-)).
Additionally I want to mention a new problem, which could affect both of our projects. See here
Nice work! As you can see, I am implementing the same idea both on AVR/ESP and Linux, so we can continue to collaborate ;-)
For the moment, I'm happy with the default Raspbian kernel, not having run into the problems that you have yet. I'll keep it in mind should I encounter them, though! Thanks!
The continuous alternating read/write effectively halves your throughput, which you can recover by doubling the number of end points. An alternative is to simply reduce the number of acknowledgements required. My thinking is to eliminate the ACK's from my code entirely, and assume that the packet was received unless a NAK packet is received. This would be triggered by an out of sequence sequence number being received (1,2,4, for example). If this happens, the sender could retransmit the missing packet, and continue from that point.
Of course, this means that the sender needs to keep a number of past packets available for retransmission if necessary. And, given that my sequence numbers are only 4 bits, there is a chance that the wrong packet gets retransmitted. Hmmm!
I wonder how well "packing" a number of ACK's into a single packet would work. i.e. Set a "P-ACK" flag, then pack a whole lot of ACK's into the data portion of the packet. At 4 bits, and 60 data bytes, one could pack up to 120 ACKs into a single packet, significantly reducing the number of ACK packets required, and boosting the one-way data rate. Instead of 1:1 (data:ack), you could get 120:1, and the one way traffic would essentially approach the maximum. Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data.
Regardless, implementing the blocking read thread, and updating the rest of the code in Proxy.ps1 should result in significant improvements! Let's see if I can actually achieve 32000 bytes per second?!
I'm not sure why, but I'm at 48KBps with alternating read/write (should be <32KB).
https://www.youtube.com/watch?v=MI8DFlKLHBk&yt:cc=on
Used code is here (stage1_mini.ps1, stage2.ps1 and hidsrv9.py): https://github.com/mame82/P4wnP1/tree/devel/hidtools
I have to recap your suggestions and remaining possibilities:
MemoryStream
instead of Byte[]
. The Performance improvment seems to be influenced by disabeling of other USB functions (RNDIS/ECM) and more Speed increase was achieved by using AsyncWrite
on the MemoryStreamsSo for me, the alternating read/write approach seems the way to go, for the following reasons:
Maybe the report ID could be used to improve things further (destination port, or in your case the channel could be outsourced to there). I don't know how much the report descriptor grows if one defines 255 possible report IDs (0 seems to be reserved for no ID in use)
Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data.
Why ? As far as I understand your idea, payload data would be decoupled from Logical Header data (SYN, ACK, P-ACK, SEQ). The Scenario of changing send direction is like PUSH..., instead of sending the P-ACK packet, the old Sender is waiting for, the new sender sends its data (with an urgent or push flag) and the old Sender (which is now receiver) knows that he has to reassemble an incoming stream, before he continues to send its own data (and receive the p-ACK). A peer which wants to send data, while still receiving an incoming stream (in the middle of the 120 packets) could decide on his own, if he wants to send his data with URGENT flag (PUSH) or if the pending outbound data should be cached in a send queue till the incoming transmission is fully received. If the outbound buffer is growing too large, due to permanent incoming transmissions, data could be forced to be sent with PUSH flag set.
But from my understanding, no matter how far you optimize, it ends up in a more optimized HALF-DUPLEX mode.
The alternating read/write I'm using asures that a FULL DUPLEX channel is available at anytime (but at half rate ... still not sure why I achieved more than 32KB/s), as every packet could carry payloads in the direction needed (on demand).
Yes, indeed, you are absolutely right, that there is no need to flush the P-ACKs.
As you have observed, I am modelling my protocol on TCP, so I'm not sure where you get half-duplex from. Using the P-ACK approach, each side can get up to (120/121)% of the available bandwidth for their own transmissions, if the other side is not transmitting any actual data. If both sides need to transmit, it should balance appropriately depending of the ratios that each side requires, up to a 50:50 split.
What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing.
Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher.
up to a 50:50 split
That's why I stated one ends up with half-duplex (if both side are sending, although your idea scales very well, if only one side is sending)
What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing.
I'm still trying to get this. I was planning to extend your read-loop example, to a test with concurrent but independant read and write on both ends (separate threads). As I still don't get, why I could reach a Transfer rate >32KBps, I think it could be possible that the device file allows writing and reading at the same time (SHARED ACCESS) which essentialy would mean FULL DUPLEX is possible with a single HID device interface.
I have to delay this test, as it turns out that the WriteAsync
method of my MemoryStream
objects is only available with NET 4.5 and I lost backward compatibility to Windows 7, which I want to fix first.
Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher.
Sure, but moving away from HID would mean to get more loud (triggering Endpoint Protection on target). I like the HID approach very much and using ACM or other Communication Classes would be to easy ;-) .
News:
Replacing AsyncWrite
on my Background MemoryStreams
with BeginWrite/EndWrite
dropped my Transfer rate into the expected range:
End receiving /tmp/test received 1.048.517 Byte in 41,2182 seconds (24,84 KB/s)
Every call to BeginWrite
assures that EndWrite
is called for previous write, which I haven't done like this while using WriteAsync
. Inspecting the source of the AsyncWrite
implementation in MemoryStream.cs
is seams every AsyncWrite is done in a seperate AsyncTask. I'm still not sure how my Transfer rate could grow above 32KB/s with AsyncWrite, as I'm waiting fo an answer to every packet sent before sending more data. Anyway, I updated my code and move on to prepare a concurrent read/write test.
Going to ping back with the results
I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it.
I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it.
Your're again right, benefit couldn't be measured, but I was wondering why I achieved this high rate with WriteAsync ?!?!
Anyway, meanwhile I've found the answer. Good News... believe it or not: The devicefile is FULL DUPLEX. You can send data in both directions with about 64 KB/s in parrallel
Test output powershell side (read and writing at the same time):
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
16384 reports have been read in 16.5860813 seconds (61.7385132436316 KB/s)
16385 reports have been written in 16.61491 seconds (61.635151800401 KB/s)
Killing remaining threads
Godbye
And the other end (python, first read is the Trigger to start threads):
Count 1 reports read in 0.000141143798828 seconds (0.0 KB/s)
Count 16384 reports written in 16.5881521702 seconds (61.730805788 KB/s)
Count 16383 reports read in 16.6130959988 seconds (61.5779262382 KB/s)
Althoug I didn't measured overall time, reading and writing have been done concurrently. The difference is, that I fully decoupled inbound and outbound data (no echo Server).
I haven't implemented any tests for packet loss, but sending and receiving threads have to have matching Report Counts in order to allow the threads to terminate. So it is very likely that there is no packet loss. Anyway, packet Content has to be checked on both ends (I'm sure that I ran into an issue, where riding a Report before reading the pending input Report cleared the input Report).
Testing this is easy, as i included routines to print out inbound data on both sides (disabled to reduce influence on time measurement).
Test code will be up in some minutes
Next interesting Observation: Starting only the powershell side of communication (no python endpoint on Pi Zero) it turns out that write()
is blocking to. If no reports are read on RPi's end, a Maximum of 4 reports could be written before write()
blocks.
This means there should never be any packet (=Report) loss, which again means there's no Need to use ACKs on per Report Basis. One could simply reassemble reports to a bigger stream. No Problem on Linux, but PS is still missing something like FIFOMemoryStream (which in fact could be implemented esily on your side, as you use inline C-Sharp).
Test from PS printing out Report Count on write, without listener on RPi running:
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
reports written 0
reports written 1
reports written 2
reports written 3
Note: None of the pending reports is lost, if the python side is started some minutes later
@RoganDawes here're the test files https://github.com/mame82/tests/tree/master/fullduplex
I guess I have to reimplement everything, as alternating read/write is the worst approach.
Have you found a replacement for named pipes to Interface with upper protocol layers (some sort of FIFO Stream available on NET 3.5) ?
Very interesting results! I guess my implementation was naive, as an echo server, introducing delays!
I never bother reassembling the stream, I simply write the data portion of the packets to their destination as I receive them. So I have no need for a FIFO memory stream at all.
The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either!
The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either!
I haven't had packet loss at any time. This seems to be clear now, as write() is blocking if data isn't read on the other end. Read() was blocking, too.
I never bother reassembling the stream, I simply write the data portion of the packets to their destination as I receive them. So I have no need for a FIFO memory stream at all.
Was thinking about a more common Interface. You acted with /dev/hidg1
and socat directly, which is the natural way on Linux - i was looking for the same on Windows (common Interface to pipe, for example into a socks5 proxy). But I guess this was more sort of dreaming than real thinking, at least in Microsoft world of things ;-)
Very interesting results! I guess my implementation was naive, as an echo server, introducing delays!
This in fact leads to half transmit rate, as a full Report has to be received before something gets written (same as my now absolutly useless read-than-write)
mmm. try putting a:
[System.Console]::Write("R")
in your read loop, and see what happens.
Better yet, try doing a
seq -f "%063g" 1 1000 > /dev/hidg1
on the Pi, when there is nothing running on the other end. If it finishes, the packets got lost . . . .
Had put in this (Console::Write() doesn't work on PS ISE with threads):
# normal script block, should be packed into thread later on
$HIDinThread = {
$hostui.WriteLine("Reading up to $in_count reports, with blocking read")
$inbytes = New-Object Byte[] (65)
$sw = New-Object Diagnostics.Stopwatch
for ($i=0; $i -lt $in_count; $i++)
{
$cr = $HIDin.Read($inbytes,0,65)
if ($i -eq 0) { $sw.Start() }
$utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes)
$hostui.WriteLine($utf8)
}
$sw.Stop()
$timetaken = $sw.Elapsed.TotalSeconds
$KBps = $in_count * 64 / 1024 / $timetaken
$hostui.WriteLine("$in_count reports have been read in $timetaken seconds ($KBps KB/s)")
}
Result:
____________________________________________________________________________________________________________________________________________________________________________________
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Writing 16385 reports with synchronous 'Write'
Reading up to 16384 reports, with blocking read
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
... snip... (no loss)
000000000000000000000000000000000000000000000000000000000000998
000000000000000000000000000000000000000000000000000000000000999
000000000000000000000000000000000000000000000000000000000001000
As said, receive Count was capped by the ´for Loops, which only terminate if exactly the number of packets has been received which has been sent before.
One more not: Writing to /dev/hidg1 of course isn't blocking. If no reader is in place on he Windows end, the data is lost.
The only case of Report loss I could imagine, would be if reports are written on Linux end (slow RPi) and the Windows end is reading back to slow (unlikely but possible).
But you're again right: No listener on Windows = data loss
@RoganDawes I guess the solution is here:
Started powershell threads first, but deployed a delay in read thread to force packet loss:
# normal script block, should be packed into thread later on
$HIDinThread = {
$hostui.WriteLine("Reading up to $in_count reports, with blocking read")
$inbytes = New-Object Byte[] (65)
$sw = New-Object Diagnostics.Stopwatch
for ($i=0; $i -lt $in_count; $i++)
{
$cr = $HIDin.Read($inbytes,0,65)
if ($i -eq 0) { $sw.Start() }
Start-Sleep -m 100 # try to miss reports
$utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes)
$hostui.WriteLine($utf8)
}
$sw.Stop()
$timetaken = $sw.Elapsed.TotalSeconds
$KBps = $in_count * 64 / 1024 / $timetaken
$hostui.WriteLine("$in_count reports have been read in $timetaken seconds ($KBps KB/s)")
}
Additionally I added Console Output before the first Report is sent from PS out thread:
for ($i=0; $i -lt $out_count; $i++)
{
if ($i -eq 0) { $hostui.WriteLine("Sending first report out on send thread")} # output is blocked by other thread if interacting with $hostui, so this line couldn't be placed exactly
$HIDout.Write($outbytes,0,65)
if ($i -eq 0) { $sw.Start() }
#$hostui.WriteLine("reports written $i") # test how many reports are needed ill write() blocks if no receiver on other end
}
Starting the PS Process first and running:
seq -f "%063g" 1 100 > /dev/hidg1
I get the following interesting result:
____________________________________________________________________________________________________________________________________________________________________________________
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
Sending first report out on send thread
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000069
000000000000000000000000000000000000000000000000000000000000070
000000000000000000000000000000000000000000000000000000000000071
000000000000000000000000000000000000000000000000000000000000072
000000000000000000000000000000000000000000000000000000000000073
000000000000000000000000000000000000000000000000000000000000074
000000000000000000000000000000000000000000000000000000000000075
000000000000000000000000000000000000000000000000000000000000076
000000000000000000000000000000000000000000000000000000000000077
000000000000000000000000000000000000000000000000000000000000078
000000000000000000000000000000000000000000000000000000000000079
000000000000000000000000000000000000000000000000000000000000080
000000000000000000000000000000000000000000000000000000000000081
000000000000000000000000000000000000000000000000000000000000082
000000000000000000000000000000000000000000000000000000000000083
000000000000000000000000000000000000000000000000000000000000084
000000000000000000000000000000000000000000000000000000000000085
000000000000000000000000000000000000000000000000000000000000086
000000000000000000000000000000000000000000000000000000000000087
000000000000000000000000000000000000000000000000000000000000088
000000000000000000000000000000000000000000000000000000000000089
000000000000000000000000000000000000000000000000000000000000090
000000000000000000000000000000000000000000000000000000000000091
000000000000000000000000000000000000000000000000000000000000092
000000000000000000000000000000000000000000000000000000000000093
000000000000000000000000000000000000000000000000000000000000094
000000000000000000000000000000000000000000000000000000000000095
000000000000000000000000000000000000000000000000000000000000096
000000000000000000000000000000000000000000000000000000000000097
000000000000000000000000000000000000000000000000000000000000098
000000000000000000000000000000000000000000000000000000000000099
000000000000000000000000000000000000000000000000000000000000100
Seems there's no Report loss if first send has taken place from Windows side
Last assumption is wrong:
000000000000000000000000000000000000000000000000000000000000081
000000000000000000000000000000000000000000000000000000000000183
000000000000000000000000000000000000000000000000000000000000290
000000000000000000000000000000000000000000000000000000000000393
000000000000000000000000000000000000000000000000000000000000495
So ACKs have to be sent from Pi to Windows :-(
Assuring that Windows end reads fast enough couldn't be done reliably otherwise, I guess
Looks like you lost 68 reports to me?
On Mon, 3 Apr 2017, 17:25 mame82, notifications@github.com wrote:
@RoganDawes https://github.com/RoganDawes I guess the solution is here:
Started powershell threads first, but deployed a delay in read thread to force packet loss:
$HIDinThread = { $hostui.WriteLine("Reading up to $in_count reports, with blocking read")
$inbytes = New-Object Byte[] (65)
$sw = New-Object Diagnostics.Stopwatch
for ($i=0; $i -lt $in_count; $i++)
{
$cr = $HIDin.Read($inbytes,0,65)
if ($i -eq 0) { $sw.Start() }
Start-Sleep -m 100 # try to miss reports
$utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes)
$hostui.WriteLine($utf8)
}
$sw.Stop()
$timetaken = $sw.Elapsed.TotalSeconds
$KBps = $in_count * 64 / 1024 / $timetaken
$hostui.WriteLine("$in_count reports have been read in $timetaken
seconds ($KBps KB/s)") }
Additionally I added Console Output before the first Report is sent from PS out thread:
for ($i=0; $i -lt $out_count; $i++)
{
if ($i -eq 0) { $hostui.WriteLine("Sending first report out on
send thread")} # output is blocked by other thread if interacting with $hostui, so this line couldn't be placed exactly
$HIDout.Write($outbytes,0,65)
if ($i -eq 0) { $sw.Start() }
#$hostui.WriteLine("reports written $i") # test how many
reports are needed ill write() blocks if no receiver on other end }
Starting the PS Process first and running:
seq -f "%063g" 1 100 > /dev/hidg1
I get the following interesting result:
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1 Path: \?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030} Invalid handle Path: \?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-
f16f-11cf-88cb-001111000030} Input: 65, Output: 65 Reading up to 16384 reports, with blocking read Writing 16385 reports with synchronous 'Write' Sending first report out on send thread 000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000069
000000000000000000000000000000000000000000000000000000000000070
000000000000000000000000000000000000000000000000000000000000071
000000000000000000000000000000000000000000000000000000000000072
000000000000000000000000000000000000000000000000000000000000073
000000000000000000000000000000000000000000000000000000000000074
000000000000000000000000000000000000000000000000000000000000075
000000000000000000000000000000000000000000000000000000000000076
000000000000000000000000000000000000000000000000000000000000077
000000000000000000000000000000000000000000000000000000000000078
000000000000000000000000000000000000000000000000000000000000079
000000000000000000000000000000000000000000000000000000000000080
000000000000000000000000000000000000000000000000000000000000081
000000000000000000000000000000000000000000000000000000000000082
000000000000000000000000000000000000000000000000000000000000083
000000000000000000000000000000000000000000000000000000000000084
000000000000000000000000000000000000000000000000000000000000085
000000000000000000000000000000000000000000000000000000000000086
000000000000000000000000000000000000000000000000000000000000087
000000000000000000000000000000000000000000000000000000000000088
000000000000000000000000000000000000000000000000000000000000089
000000000000000000000000000000000000000000000000000000000000090
000000000000000000000000000000000000000000000000000000000000091
000000000000000000000000000000000000000000000000000000000000092
000000000000000000000000000000000000000000000000000000000000093
000000000000000000000000000000000000000000000000000000000000094
000000000000000000000000000000000000000000000000000000000000095
000000000000000000000000000000000000000000000000000000000000096
000000000000000000000000000000000000000000000000000000000000097
000000000000000000000000000000000000000000000000000000000000098
000000000000000000000000000000000000000000000000000000000000099
000000000000000000000000000000000000000000000000000000000000100
Seems there's no Report loss if first send has taken place from Windows side
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sensepost/USaBUSe/issues/15#issuecomment-291177650, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJwi7SSO5kqpBO_UhEbDbsNl9ukEHUWks5rsQ9zgaJpZM4MuqqN .
Looks like you lost 68 reports to me?
Indeed, till first Report was sent from Windows.
But raising the simulated read delay, produce more Report loss (assumption on send is needed was wrong).
So again I struggle on Windows, as I know Linux isn't able to send if nothing is received (remeber Crash on unresponsive IRQ when sending data to /dev/hidg1 before Windows is able to read).
So one has to assure one read per millisecond on Windows or use ACKs from RPi to Windows.
Seems your P-ACK idea is the best way to do this, again you're right
So, as discussed, let Windows be the first to send a packet. Once the Linux side has received a packet, you know that the Windows side is ready to receive, and communications can begin.
Alternatively, you can monitor the dmesg output to see when the relevant USB configuration has been selected by the Windows host to know that the endpoint has been "activated". However, it still doesn't mean that the powershell is running yet - this you can "discover" by waiting for the powershell to send you the first packet!
Yes, I'm already doing this, both in production code and in the full duplex example provided above. Anyway I was wrong assuming that reports aren't missed when sending is started from Windows. It is simple... Linux writes to HID device non-blocking and reading from FileStream misses reports if done to slow. This isn't the case for sending from Windows to Linux. While searching for a solution to replace FileStream.read() with something which gives access to the underlying buffer (to block writing on the other end if buffer is full) I stumpled across feature reports. Not only they don't rely on FileStream, the are handled with control transfers instead of IRQ. So if I'm right the 1000 reports per second boundary doesn't apply on feature reports.
I'm thinking about new tests, changing underlying mechanics away from inpit/output reports, but I'm running out of time.
Well, go with the 32kBps option in the meantime, version 2 can get higher speed ;-)
Thumps up for this comment...I'm already suffering from tunnel-vision trying to optimize low level HID communications while loosing focus on other things which I wanted to implement in P4wnP1. But it doesn't get boring, another funny thing is that I'm faking RNDIS to run on 20 GBit / s which involves different issues. If you're interested in this, here's the link https://github.com/mame82/ratepatch (applies on raspbian with kernel 4.4.50+)
Not suprisingly I'm still thinking about Report loss occuing when writing to Linux /dev/hidg and reading back from powershell to slow.
So please excuse the next large paste of test Output. I observed the following. Running the read Loop on Windows with a 500 ms delay, Report loss is assured. I now started such a read Loop on PS and large chunks of Output reports from Linux. The Report Content is "Number xxx" - xxx represent the Report number written. The read Loop outputs the Report Content and a number representing the Count of the read Loop
New Observation: if sending is aborted on Linux side, the last 32 reports could be readen back (with 500 ms delay) without having any loss. This means the FileStream is backed by a 2048 Byte buffer. If it would be possible to Access this buffer directly from powershell, a notification could be sent back to block writing on the other side (including last seq number received). Unfortunately, I haven't found a way to Access the underlying buffer... FileStream.Position and FileStream.Length are both unset.
So if you're going to implement your P-ACK idea, it seems 32 reports is the magic number to track end send ACKs for. So your sequence number misses exactly one bit to cope with that.
Here's the testoutput showing the described behaviour. The parts where large amounts of reporst are missed, have been caused by unlimited sending. The parts with continues Report received are the result of manually Abort sending from Linux side (last 32 reports are reconstructed with 500 ms delay)
Number 0
0
Number 327
1
Number 594
2
Number 595
3
Number 596
4
Number 597
5
Number 598
6
Number 599
7
Number 600
8
Number 601
9
Number 602
10
Number 603
11
Number 604
12
Number 605
13
Number 606
14
Number 607
15
Number 239
16
Number 364
17
Number 365
18
Number 366
19
Number 367
20
Number 368
21
Number 369
22
Number 370
23
Number 371
24
Number 372
25
Number 373
26
Number 374
27
Number 375
28
Number 376
29
Number 377
30
Number 378
31
Number 379
32
Number 380
33
Number 381
34
Number 382
35
Number 383
36
Number 384
37
Number 385
38
Number 386
39
Number 387
40
Number 388
41
Number 389
42
Number 390
43
Number 391
44
Number 392
45
Number 393
46
Number 394
47
Number 395
48
Number 0
49
Number 306
50
Number 651
51
Number 994
52
Number 1045
53
Number 1046
54
Number 1047
55
Number 1048
56
Number 1049
57
Number 1050
58
Number 1051
59
Number 1052
60
Number 1053
61
Number 1054
62
Number 1055
63
Number 1056
64
Number 1057
65
Number 1058
66
Number 1059
67
Number 1060
68
Number 1061
69
Number 1062
70
Number 1063
71
Number 1064
72
Number 1065
73
Number 1066
74
Number 1067
75
Number 1068
76
Number 1069
77
Number 1070
78
Number 1071
79
Number 1072
80
Number 1073
81
Number 1074
82
Number 1075
83
Number 1076
84
Number 0
85
Number 1
86
Number 2
87
Number 3
88
Number 4
89
Number 5
90
Number 6
91
Number 7
92
Number 8
93
Number 9
94
Number 10
95
Number 11
96
Number 12
97
Number 13
98
Number 14
99
Number 15
100
Number 16
101
Number 17
102
Number 18
103
Number 19
104
Number 20
105
Number 21
106
Number 22
107
Number 23
108
Number 24
109
Number 25
110
Number 26
111
Number 27
112
Number 28
113
Number 29
114
Number 30
115
Number 31
116
Number 0
117
Number 127
118
Number 128
119
Number 129
120
Number 130
121
Number 131
122
Number 132
123
Number 133
124
Number 134
125
Number 135
126
Number 136
127
Number 137
128
Number 138
129
Number 139
130
Number 140
131
Number 141
132
Number 142
133
Number 143
134
Number 144
135
Number 145
136
Number 146
137
Number 147
138
Number 148
139
Number 149
140
Number 150
141
Number 151
142
Number 152
143
Number 153
144
Number 154
145
Number 155
146
Number 156
147
Number 157
148
Number 158
149
Here's the example output
Interesting! So, if I limited my "packets in flight without ACK" to 16 (max of my sequence numbers), I could be sure that there would be no packet loss. Funnily enough, I instrumented my "echo loop" to indicate how many reports there were in the queue at the beginning of the while loop. Not once did I get more than 16 reports, with a 1ms sleep once the queue was drained.
The unfortunate part is that my sequence numbers are per "connection", of which I can have up to 255 at once (in theory). So I'd have to track the unacknowledged packets at a different level. Which unfortunately, is a bit of a layering violation, I think.
I think the "solution" is going to be making sure that the read loop just reads as fast as possible, and if any packets are observed to be missing, to send a RST on that channel, and let it start again. Not particularly robust, but should work, I hope!
FWIW, by simply substituting the $device.BeginRead/EndRead pairs with dequeueing packets from the readloop/queue, I managed to get 13kBps throughput with a cmd.exe doing "dir /s". Strangely, when writing the packets out to a socket, the throughput dropped to about 8kBps.
@RoganDawes
While starting to implement a new lower layer communication scheme, based on our observations I put some comments (design ideas) into the source of the concurrent read/write
testcase (the one with 64000 Bytes/s full Duplex on a single device file), to avoid report loss.
As this isn't implemented in 5 minutes, I'd like to kindly ask you to Review These comments before I start coding (and maybe throw it away in the end).
Idea (From Linux Point of view = USB device, not Host):
# Writing out reports doesn't mean that the receiver is able to read
# them back (if reading to slow, writing from this side isn't blocking)
#
# If the receiver is Windows via FileStream object, it was obeserved that
# exactly 2048 bytes = 32 reports are cached in a ring buffer, which gets
# overwritten if more reports are sent before reading themm back
#
# To assure every report is readen a report loss detection is applied
# only for reports written to the host (HID input reports) as it has been
# observed that reports readen from the host (OUTPUT reports) don't get lost
# (write call to HID device FileStream blocks after writing 4 reports
# without reading them back on this end)
#
# So outgoing sequence numbers are deployed, reaching from 0 to 31 to
# match the FileStream Buffer on windows.
# Outgoing report format is (INPUT REPORT for host):
# 0: length (effective payload length in report, excluding header)
# 1: seq (outgoing sequence number)
# 2: src (like source port, but 0..255 - should maybe moved to an upper layer)
# 3: dst (like destination port, but 0..255 - should maybe moved to an upper layer)
# 4..63: payload, padded with zeroes if needed
# Note: report ID isn't needed at gadget side, thus report size is 64 bytes
# Incoming report format is (OUTPUT REPORT from host):
# 0: length (effective payload length in report, excluding header)
# 1: ack (acknowledge number, holding the last SEQ number the host has read back)
# 2: src (like source port, but 0..255 - should maybe moved to an upper layer)
# 3: dst (like destination port, but 0..255 - should maybe moved to an upper layer)
# 4..63: payload, padded with zeroes if needed
# Note: report ID isn't needed at gadget side, thus report size is 64 bytes
# packets are constantly read and written. This could be seen as carrier
# Both peers are able to send data at any time, by putting it into the report payload (length > 0)
# If no data should be sent from one peer, an empty report (length = 0) is sent anyway, to assure
# continuous delivering o sSEQ and ACK numbers.
#
# The USB client (this device) is only allowed to send up to 32 reports (MAX_SEND) without
# receiving an ACK. These up to 32 reports are cached in an outbound queue, to allow resending if needed.
# This qualifies for handling of the following error cases:
# Error case 1:
# The last ACK received with an OUTPUT REPORT isn't the next one awaited. As no scheduling or
# priorization functionallity is introduced in this protocol layer, received ACKs occure in sequential
# manner always (ACKs are never lost, as HOST to device communication is save in terms of report loss).
# Examples for valid ACK sequences:
# 0, 1, 2, 3 ...
# 30, 31, 0, 1 ...
# Example for invalid ACK sequences:
# 0, 1, 2, 5
# 31, 1
# Note: The corner case, that exactly 33 rerports (or n%32+1 reports) are missed, would lead to an ACK
# sequence like this: 29, 30 ... miss 33 ... ,2. This in fact could never happen, as writing reports
# without received ACK is blocked if the outbound queue grows to 32.
#
# Error case 1 detection:
# awaited_ACK = (last_ACK + 1) % 32 # modulo could be replaced with: if (awaited_ACK > 31) awaited_ACK -= 32
# received_ACK != awaited_ACK
#
# Error case 1 cause:
# The USB host (receiver of input report) missed all reports sent, starting from the last valid
# ACK received
# Example:
# last_valid_ACK = 10
# received_ACK = 2
# last_seq_sent = 8
#
# Input reports have been sent to the host, up to SEQ number 8.
# The host received valid reports up to last_valid_ACK=10 (last valid SEQ number seen by the receiving host was 10).
# The next report received by the host is received_ACK=2 (last SEQ number seen by the receiving host was 2, the host
# already is aware of the facted, that this wasn't the right SEQ number and ignores these packets).
#
# At this point, it is obvious that the receiving host missed the following reports:
# 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1 (last_valid_ACK + 1 ... received_ACK - 1)
# At this point it isn't known if the host missed the reports 3 to 8, which have already been sent. Thus it is decalred,
# that THE RECEIVER HAS TO IGNORE RPORTS WITH OUT-OF-BOUND SEQ NUMBER. (Another approach would be to allow th receiver to cache out-of-bound
# reports, in order to only resend reports which have been lost. This would come at the cost of additional logic and resending "old"
# reports would cause out-of-nound sequence numbers itself, resulting in more complex error detection)
#
# To cope with that report loss, reports 11 ... 1 have to be sent, followed by reports 2 ... 10 as these have been ignored by the receiver.
#
# This isn't the most efficient approach, as in worst case missing a single report on receiver side could lead to resending up to
# 30 reports, which already had been sent (and ignored by the receiver). But as the first missing report is always send first and ACKs
# are received from a parallel thread, this comes down to a "1 report sent / 1 report acknowledged" case (hopefully, has to be tested)
#
# Error case 1, action to take:
# If out-of-bound ACK is received, all reports from last_valid_ACK+1 to last_seq_sent have to be retransmitted by the sender (USB device).
# If the receiver (USB host) recognizes an out-of-bound SEQ number, the report is ignored
#
# Additional note:
# The design ideas describe apply to the write thread of the USB device (writing INPUT reports) and the read thread of the USB host (reading INPUT reports).
# Anyway, the ACKs are carried with output reports. As read and write loops are ran in independent threads on both peers, states like "last_valid_ACK"
# and "last_SEQ_sent" are kep in synchronized global sate objects, shared between the read and write thread of a peer.
# Otherwise read and write threads are decoupled and independent, there's nothing like write-after-read (to hopefully achieve maximum throughput, at least
# for output reports). This again means, that if a SEQ number is sent, it could take several incoming packets, till the ACK is received (up to 32 before
# sending from device to host is blocked).
Good news, implemented FULL DUPLEX similar to suggestion from above with some some improvments.
Results:
PS D:\P4wnP1\powershell> D:\P4wnP1\powershell\fullduplex\fullduplex4.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#7&27da95e8&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Starting thread to continuously read HID input reports
Starting write loop continously sending HID ouput report
Global seq number readen 31
MainThread: received report: port nr. 0
MainThread: received report: port nr. 1
MainThread: received report: port nr. 2
... snip ... (no report loss in between)
MainThread: received report: port nr. 17405
MainThread: received report: port nr. 17406
MainThread: received report: port nr. 17407
Total time in seconds 22.2611788
Throughput in 45731,6303483444 bytes/s netto payload date (excluding report loss and resends)
Throughput out in the same time 53357,2822298161 bytes/s netto output (19158 reports)
Killing remaining threads
Godbye
So I'm on ~45500 Bytes/s from Pi to Powershell (real netto data, without protocol headers) and on ~53000 Bytes/s from PowerShell to Pi (concurrent full duplex read/write on single HID device file).
Report loss detection (includes blocking if Output buffer reached 32 Reports which haven't been read and resending of unacknowledged reports) is only done from Pi to Windows (HID Input Reports), as we know the other way around writes are blocked if no data is read back (assuzmption still holds true in all Tests, max, 4 Report writes to FileStream without read on Linux end).
Protocol overhead is reduced to 2 header bytes on "link layer" so payload size is 62 Bytes per report.
If you're interested in work in Progress code, ping back.
Btw. I decided to Interface to upper layers with synchronized input/output queues consuming/holding pure Reports...this still isn't fully implemented. Fragmentation/defragmentation of larger streams is going to be handled in upper layers (based on a FIN bit in reports). DST / SOURCE fields (or channel in your case) will be moved to upper layers, too. I don't nned this information on link layer anymore. Reason endpoints of this layer are well defined and pre-known USB-Host<-->USB-Device (or PowerShell-Client <--> Python-Server)
Forgot to mention, measurement was on win 10 64 bit. On Win 7 32 Bit throughput is far slower (going to test tomorrow, code is PS 2.0 and NET 3.5 compatible, not sure if this is the bottleneck on Win 7)
I think the basic idea is solid. My approach is to have a single "channel identifier", resulting in a max of 256 (concurrent!) channels, rather than 65536, which I think is reasonable in the circumstances.
Packet format in my case (following some of your ideas above, so not implemented yet) would look like:
I'm still not 100% convinced about acking every packet - i.e. making a continuous stream of comms at full rate, as this will require fairly significant resources to keep up, possibly resulting in suspicious activity on the victim being detected.
Continue of ongoing discussion from here