Discussion - full duplex by splitting HID in / out into separate composite functions (no issue)

mame82 commented 7 years ago

Continue of ongoing discussion from here

RoganDawes commented 7 years ago

And indeed, now that you mention it, having the sequence number and packet length first, and leaving the channel and payload for a higher layer makes perfect sense, even in my implementation.

Nice work!

mame82 commented 7 years ago

So here's the test result on Windows 7 (a bit disappointing compared to win 10)

__________________________________________________________________________________________________________________________________________________________________________________
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\fullduplex4.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Starting thread to continuously read HID input reports
Starting write loop continously sending HID ouput report
Global seq number readen 31
MainThread: received report: port nr. 0                                                  
MainThread: received report: port nr. 1                                                  
MainThread: received report: port nr. 2                                                  
... snip ... (no reports lost)                       
MainThread: received report: port nr. 17404                                              
MainThread: received report: port nr. 17405                                              
MainThread: received report: port nr. 17406                                              
MainThread: received report: port nr. 17407                                              
Total time in seconds 34.1784225
Throughput in 29757,0199443816 bytes/s netto payload date (excluding report loss and resends)
Throughput out in the same time 31754,2449479639 bytes/s netto output (17505 reports)
Killing remaining threads
Godbye

____________________________________________________________________________________________________________________________________________________________________________________

I'm still not 100% convinced about acking every packet - i.e. making a continuous stream of comms at full rate, as this will require fairly significant resources to keep up, possibly resulting in suspicious activity on the victim being detected.

I still read/write packages in both directions (Input and Output reports) at maximum possible rate (limited by file IO and CPU Speed). Report loss for output reports is prevented by underlying file stream (blocking write if no read), so I ignore this direction in terms of report loss prevention (based on SEQ numbers and ACKS). For input reports I use a SEQ number on writing, to detect loss on powershell side. To in form the sender (Raspberry) about report loss I use an ACK number on output reports (instead of the SEQ number used in input reports, saving me one header byte).

I've quit the approach, of ACK'ing every SEQ number received, instead I changed to accumulative ACKs. This raised the new question, of how to detect report loss on Pi, as ACK sequences aren't neccesarily continuos (1,4,6 is a valid ACK sequence which could be received by the Pi, which means that the last valid SEQ received was 6). If INPUT report 2,3,5 has been lost, the ACK sequence would look the same (1,4,6), with the difference that the PowerShell side is aware of the fact, that reports are missing, as they haven't arrived with continuous SEQ numbers. So my current approach was to introduce a RESEND REQUEST flag bit, which changes the purpose of the ACK field to serve as SEQ number to start resending from.

So communication Looks like this (Reader and write threads are decoupled, but share current state of SEQ number received and last ACK sent):

No report loss (PowerShell perspective, read and write thread run asynchronous with different Loop times, in example write Loop is slower): READ thread reads: SEQ 0 READ thread reads: SEQ 1 (valid, successor) READ thread reads: SEQ 2 (valid, successor) WRITE Thread writes: ACK 2 READ thread reads: SEQ 3 READ thread reads: SEQ 4 (valid, successor) READ thread reads: SEQ 5 (valid, successor) WRITE Thread writes: ACK 5

Same example with report loss: READ thread reads: SEQ 0 READ thread reads: SEQ 1 (valid, successor) READ thread reads: SEQ 4 (invalid, last valid was SEQ 1) WRITE Thread writes: ACK 2 + RESEND REQUEST bit (advice the Sender to resend all reports starting from SEQ 2) READ thread reads: SEQ 2 (valid, successor) READ thread reads: SEQ 3 (valid, successor) READ thread reads: SEQ 4 (valid, successor) READ thread reads: SEQ 5 (valid, successor) WRITE Thread writes: ACK 5 ... and so on ...

My report format:

INPUT REPORTS

# works with incoming sequence number, as reports could be lost if this host is reading to slow
# report layout for incoming reports
#    0: REPORT ID
#    1: LEN: BIT7 = fin flag, BIT6 = unused, BIT5...BIT0 = Payload Length (Payload length 0 to 62)
#    2: SEQ: BIT7 = unused, BIT6 = unused, BIT5...BIT0 = SEQ: Sequence number used by the sender (0..31)
#    3..64: Payload

OUTPUT REPORTS

# works with outgoing acknoledge number, as reports could be lost if this host is reading to slow
# valid (in-order) reports are propagated back to the sender with an acknowledge number (ACK) 
# Sender has to stop sending after a maximum of 32 reports if the corresponding ACK for the
# 32th packet isn't received
# ACKs are accumulating, this means if SEQ 0, 1, 2 are read by the HIDin Thread, without writing an 
# output report containing the needed ACKs (for example, caused by to much processing overhead in output 
# loop for example), the next ACK written will be 2 (omitting 0 and 1).
# To allow the other peer to still detect report loss, without receiving an ack for every single report,
# a flag is introduced to fire resend request. If this flag is set, this informs the other peer to resend 
# every report, beginning from the sucessor of the ACK number in the ACK field (this allows to acknowledge additional
# reports while requesting missed ones).

# report layout for outgoing reports
#    0: REPORT ID
#    1: LEN: BIT7 = fin flag, BIT6 = unused, BIT5...BIT0 = Payload Length (Payload length 0 to 62)
#    2: SEQ: BIT7 = unused, BIT6 = RESEND REQUEST, BIT5...BIT0 = ACK: Acknowledge number holding last valid SEQ number received by reader thread
#    3..64: Payload

As resend request would be repeated, till the state of the PowerShell peer changes (because reports are constantly flowing in both directions at Maximum possible Speed. Thus the Pi side keeps track of "real state changes" of the Windows end, and ist write thread (generating INPUT reports for Windows) only accounts for a new state, if relevant Header fields of a received OUTPUT Report changed during two reads (which means the state of the Windows side changed). So this "state tracking" ignores the LENGTH field, as it could Change between two OUTPUT reports (different payloads), although the ACK state (RESEND bit, ACK number and FIN bit) are the same (no state Change).

mame82 commented 7 years ago

fairly significant resources to keep up, possibly resulting in suspicious activity on the victim being detected.

Right, but as my current link layer stack is focused on robust communication and could handle large delays in read/write threads on both sides (with impact of data processing overhead in mind). This could be used to lower CPU load by introducing (conditional) delays (at cost of decrease o maximum transfer speed). So yes, I'm not sending reports on demand, but continues ... I see HID reports as electrical media compared to ISO/OSI model. This means power is continuously flowing, but not necessarily with useful payload. This again assures that header data (or state data) is exchanged instantly

mame82 commented 7 years ago

Looking at the win 7 results again, it seem that FileStream.read is blocking FileStream.write somehow. Maybe this methods are synchronized on NET3.5 ?!? I won't test if this could be circumvented by accessing the File object directly (should allow overlapped read/write) as this would involve to much additional csharp code. Anyway I'm happy with current implementation and refocus on polishing and constructing a clean interface to upper layers. I'm still thinking about putting a socks5 proxy on python end to let powershell establish requested TCP connections. This would be a scenario handling channel splitting on an upper layer and therefore a clean and robust interface has to be provided to interface with my link layer. Creating a layer interface is easy with object oriented python, but again a mess with PS <3.0 +I'm aiming at PS 2.0.

Design idea is to wrap up the link layer in a PSCustom object, which receives a device file in constructor an provides a read and write method for LARGE data streams (thread creation, fragmentation and report loss handling should be done internally). As this idea involves even more code it has to be placed on stage2 and a simplified protocol will be used to deliver stage1. This involves even more code on Raspberry side, as the server has to handle two different protocols, but I believe it is worth the effort

mame82 commented 7 years ago

One more addition. As shown in report layout comments, I'm planning to use a FIN bit. Its purpose is to mark the end of fragmented streams. A start flag isn't needed as the first report with a payload (length field > 0) starts a stream and all succeeding reports are concatenated until FIN bit is set in the terminating report.

My former approach used an empty report to terminate fragmented streams, which drops transfer speed even more. In real world usage many streams fit into a single report (example directory listing in a shell, where each output line is interpreted as single stream ... This lead to sending an empty report after each line in my old implementation, producing way to much overhead).

Are you aware of a simple PS 2.0 compatible way to create objects with custom member functions (inline cshar code is a no go for me)? Only option I found so far is PSCustom object

mame82 commented 7 years ago

Last comment on current implementation: The sequence number range in use is 0 to 31. This allows to track 32 reports, which is the max size of my output buffer on Pi. This again fits the maximum size of the input buffer of Windows FileStream observed. As this is my max, SEQ/ACK never consume more than 5 bits, leaving room to use the remaining bits as flags. The flags are extracted with cheap binary and/or/nand operations.

mame82 commented 7 years ago

My current code achieves rates > 40 KByte/s in both directions (full Duplex) on Win10 as well as on Win7 (mistakenly included Win7 console IO in first Speed tests).

I've started to migrate to an object oriented Approach to build a clean Interface for my (now called) LinkLayer.

Here's the test code, feel free to use:

https://github.com/mame82/tests/tree/master/fullduplex_fast

mame82 commented 7 years ago

One really important fact (I didn't recognized earlier). One has to use two dedicated file descriptors on Linux (although it's the same file) for reading and writing, otherwise the speed halves due to synchronized file Access.

This is at least true if python is used.

mame82 commented 7 years ago

@RoganDawes

I'm done with FullDuplex LinkLayer implementation. Works nice on Windows 10, Windows 7 is untested

Codes still needs polishing...Current features:

sync on initial Connection from Windows to Linux (fit ACK to SEQ number)
report loss detection (only from Linux to Windows, not needed the otherway around)
automatic resend of lost Reports
Keep up to 32 Reports in flight (from Linux to Windows)
fast, automatic fragmentation / defragmentation of large streams to payload size
low Header Overhead (2 Bytes, 62 Bytes payload per Report)

Performance on Win10 is fantastic:

Windows to Linux: Throughput 56244.6202298 Bytes/s
Linux to Windows: Throughput in 60447,0528938712 bytes/s netto payload data (excluding report loss and resends)

87 % of max from Windows to Pi and 93 % of max from Pi to Windows. That's insane. Max CPU consumption on Windows (during test) 15 % (Core i7 @3,6 GHz) Max CPU consumption on Pi (during test) 28%. CPU consumption on Pi 98% if PowerShell client disconnects (caused by this line of code )

Measurement shown above is done by sending from both peers at the same time Conditions in test have been ideal:

Outbound data is Enqueued before time measurement (no accessing of the synchronized Queues during data transfer. Of Course, in real world scenario this happens on regular basis and would have speed impact)
Received data is processed after time meassuring stops. Accessing the data during transfer, again, would have a speed Impact (the underlying Queues are synchronized with reader/writer threads). The real world impact on speed depends on fragmentation. Sending a stream of 6200 bytes needs a single queue access to fetch (no big speed impact). Sending 100 streams of 62 Bytes, results 100 times accessing of the queue from upper processing layer, which slows things down (same goes for reader thread itself, at it has to put data into the inbound queue more frequently, which drops rate even further). Data processing itself has no impact, as it is done in a seperate thread.
Stream size which is allowed to be put into output Queues has no hardcoded limit at the moment.

As promised, code is here: https://github.com/mame82/tests/tree/master/fullduplex_fast2

So I'm done with this Topic but kept the issue open, in case further questions arise. In case there're no more question, please close the issue.

Again, thx for this intensive and valuable exchange of Information on HID low Level development ! Hope this could push both of our Projects.

mame82 commented 7 years ago

Remarks on provided code at https://github.com/mame82/tests/tree/master/fullduplex_fast2:

Most parts of the source are implementation of LinkLayer Interface (as class in python, PSCusomObject on PowerShell)

Start of main code, utilizing the LinkLayer Objects is marked with the following comment in both source files:

#########################
# test of link layer
#########################

So as shown, using the LinkLayer objects is relativly easy (not much code, beside the implementation itself).

In order to test, you could replace the CSharp part (responsible for creating a FileStream to the HIDdevice) with your own. In case you use the current CSharp code, you have to replace Manufacturer and Serial string (I don't enumerate the device based on VID and PID).

On Linux end, my devicefile is /dev/hidg1 which has to be changed to your needs, too.

mame82 commented 7 years ago

Hi @RoganDawes it has been a while and I made great progress, although my projects seems to never get done. Meanwhile I developed a full fledged multi-channel backdoor using HID only. It It has several network layers, multiple communication channels and the initial stage is still triggered via HID Keyboard (targeting powershell). I've moved most of my Windows code to Csharp, because managing multiple threads in powershell (especially debugging them), while trying to be compatibility to PS 2.0 (no class inheritance, no classes at all) is a mess. Anyway, my early stage, which gets typed out to HID keyboard, is still PowerShell based, because even the smallest compressed NET assembly exceeds 4000 chars in size. While working on my stage 1, your "move the window offscreen" idea came into my mind, again. Because I still want to avoid inline compiling of csharp code in the PowerShell part, I've re-implmented your "SetWindowsPos" approach in pure PowerShell 2.0. As in the past, again, I want to share this with you:

$h = (Get-Process -Id $pid).MainWindowHandle
$ios = System.Runtime.InteropServices.HandleRef
$hwnd = New-Object $ios (1, $h)
$insertAfter = New-Object $ios (2, -2)
(([reflection.assembly]::LoadWithPartialName("WindowsBase")).GetType("MS.Win32.UnsafeNativeMethods"))::SetWindowPos($hwnd, $insertAfter, 200, 300, 10, 10, 4)

Additionally I want to point out my python helper class, which uses approaches discussed earlier, to prepare stage1 code for typing out via HID. As you can see in this class, I'm still using base64 encoded GZip streams, which are converted to PowerShell code on the fly. Thus things like loading custom assemblies without touching disc got possible initializing variables with code or binary data etc. became possible.

Currently I'm using my out_PS_IEX_Invoker method, which, as the Name implies, relies on Invoke-Expression. I'm not happy with using this commandlet, because this is one of the commands thread hunting focuses on. So if you have other ideas according code invocation from byte arrays or strings in powershell, please let me know.

Btw. I've choosen to implement a console based Approach as frontend for my current HID backdoor (yes, it is a bit meterpreter'ish). One connects to P4wnP1 via SSH and the frontend is embedded into a Screen session. The idea behind this approach is to implement a socks4a or socks5 server, later on, which could be reached out via the same SSH session and relay traffic through the target client. This would be a real airgap bridge.

mame82 commented 7 years ago

@RoganDawes I finished my HID backdoor and added you to the credits https://github.com/mame82/P4wnP1/blob/master/README.md. I hope you are fine with this. My final window hiding uses 'setWindowPos' instead of 'showWindowAsync'.

It could be used to make the Window invisible, while keeping focus. As P4wnP1 types chars very fast, they ran into the STDIN buffer of the target window, which couldn't be interrupted by user interaction.

The final stage one needs for lines of code to hide the window and the rest gets typed out in about 2 to 6 seconds (depending on stage1 type, pure powershell stage is more compact than the DOT NET assembly version). See my readme for reference.

RoganDawes commented 7 years ago

Very cool! Nice work! And thanks for the shoutout :-)

mame82 commented 7 years ago

Yeah, no problem. This was an inspiring conversation.

Seytonic demoed the payload https://youtu.be/Pft7voW5ui8

The final attackt starts at about 5:30 in the video... look closely, the powershell window disappears really fast. Stage 2 download an execution has finished when the status changes to "client connected".

I still use your wmi approach to enumerate fot the HID device (at least in the default version of stage 1)

sensepost / USaBUSe

Discussion - full duplex by splitting HID in / out into separate composite functions (no issue) #15