malxau / yori

Yori is a CMD replacement shell that supports backquotes, job control, and improves tab completion, file matching, aliases, command history, and more.
http://www.malsmith.net/yori/
MIT License
1.23k stars 31 forks source link

Speed of execution degrades significantly using more than 10 pipes in script #85

Closed aleaksunder closed 3 years ago

aleaksunder commented 3 years ago

Hello!

I understand everything has it's limits and not insist to get this into primary goal... this is just a remark!

I am seeing a sharp drop in performance using decent amount of stream input and output.

This is only for example but let's assume we have a text file with such content: replace.txt

0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

So in real life we have a text file encoded with non-english ASCII symbols.

And we need to replace all those 36 values with text and going to do so with following script with a lot of pipes: replace.ys1

ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "0" "-" | REPL -- "1" "-" | REPL -- "2" "-" | REPL -- "3" "-" | REPL -- "4" "-" | REPL -- "5" "-" | REPL -- "6" "-" | REPL -- "7" "-" | REPL -- "8" "-" | REPL -- "9" "-" | REPL -- "A" "-" | REPL -- "B" "-" | REPL -- "C" "-" | REPL -- "D" "-" | REPL -- "E" "-" | REPL -- "F" "-" | REPL -- "G" "-" | REPL -- "H" "-" | REPL -- "I" "-" | REPL -- "J" "-" | REPL -- "K" "-" | REPL -- "L" "-" | REPL -- "M" "-" | REPL -- "N" "-" | REPL -- "O" "-" | REPL -- "P" "-" | REPL -- "Q" "-" | REPL -- "R" "-" | REPL -- "S" "-" | REPL -- "T" "-" | REPL -- "U" "-" | REPL -- "V" "-" | REPL -- "W" "-" | REPL -- "X" "-" | REPL -- "Y" "-" | REPL -- "Z" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"

All works... but: `timethis.exe yori -nouser -c YS "replace.ys1"

Elapsed time:      0:00:11.989
Child CPU time:    0:00:00.156
Child kernel time: 0:00:00.156
Child user time:   0:00:00.000
Tree CPU time:     0:00:00.562
Tree kernel time:  0:00:00.406
Tree user time:    0:00:00.156

12 seconds certainly not a big deal but if process comes to real text file with normal contents speed degrades drastically...

The thing here is if we split those 36 REPL pipes into 36 different lines:

ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "0" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "1" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "2" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "3" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "4" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "5" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "6" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "7" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "8" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "9" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "A" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "B" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "C" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "D" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "E" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "F" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "G" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "H" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "I" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "J" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "K" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "L" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "M" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "N" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "O" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "P" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "Q" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "R" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "S" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "T" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "U" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "V" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "W" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "X" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "Y" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"
ICONV -i ascii -e utf8 -w -- "replace.txt" | REPL -- "Z" "-" | ICONV -i utf8 -e ascii -w | SPONGE -- "replace.txt"

and do that again with: `timethis.exe yori -nouser -c YS "replace.ys1"

Elapsed time:      0:00:01.738
Child CPU time:    0:00:00.436
Child kernel time: 0:00:00.421
Child user time:   0:00:00.015
Tree CPU time:     0:00:02.281
Tree kernel time:  0:00:01.656
Tree user time:    0:00:00.625

So despite there was 36 disk reads 72 encoding conversions and 36 disk writes it is 6 times faster than 1st variant... one problem here is we need to check an ERRORLEVEL after each of 36 lines to ensure that all was correctly.

And by the way in real life with 30 lines in script we get speed something around 5 seconds per file and with one line in script with 30 pipes speed drops to 30 seconds per file!

And this is not about REPL only... I've tried a lot of sequiental pipes around multipe ICONV and it seems it degrades in the same way!

So it is basically 6 times faster everywhere I've seen, tried this with CMD YORI YORIMIN and ONEYORI... and the numbers always almost the same!

And just for example tried this in CMD with external findstr.exe:

CMD> echo "long long long long long long long long long long long long long long long long long long long text"  | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text" | findstr "text"

So there is 128 pipes and nothing seems to degrade significantly... I am basically want to say that as I see this is not due to Windows piping mechanism!

Or maybe you can advice how to speed things up?

P.S. I'm so sorry to be so annoying =) I believe I am doing useful things for this project

malxau commented 3 years ago

Well, the fastest way to speed it up:

C:\src\yoriperf>timethis oneyori.exe -nouser -c test.ys1
Elapsed time:      0:00:00.203
Child CPU time:    0:00:00.203
Child kernel time: 0:00:00.125
Child user time:   0:00:00.078
Tree CPU time:     0:00:00.203
Tree kernel time:  0:00:00.125
Tree user time:    0:00:00.078

(ie., using oneyori for this is 40x faster.)

I learned a few horrifying things about Windows while developing Yori, but I think this issue is hitting at the worst: in Windows, CreatePipe will create a synchronous pipe, and there is no way to know if a synchronous pipe has data available because the expectation of the named pipe system is to wait until there is a blocking write and a blocking read at the same time, then shuttling the data across, then completing the requests. This is bad for two reasons: a) performance is terrible due to context switching so the named pipe system needed to add a buffer which undermines the reasoning here and b) processes can't wait on multiple things simultaneously, which makes it impossible to have a cancelable operation, and threads don't help since a worker thread can't check for data on the pipe and check for any communication from another thread.

So Yori did the best thing it could do. It spins waiting to see if it should cancel or if there's data in the pipe. It can't spin on the CPU (which would destroy performance) so there's a sleep between each spin to let the other process make progress. The length of the sleep increases progressively, because if no data has arrived in several seconds it's unlikely that it will arrive a microsecond later.

But this sequence trips up this logic badly. The first couple of processes make progress; the later processes observe no data is arriving and start sleeping for longer and longer; and when data arrives, they are waiting for tens of milliseconds before noticing that any data is available. This magnifies down the chain, because if one process waits pointlessly for 10 milliseconds everything behind it starts increasing the duration of their sleeps, so the next one waits pointlessly for 20 milliseconds, etc. Amusingly, doing this with oneyori is fast because it won't launch 36 things in parallel - it's executing one command at a time, so there's never a time when a reader is waiting for data for a long time.

I have a tweak for this logic to increase the length of the sleep more slowly, which helps in this scenario, but note this is not a free lunch. Every time the thread wakes up and does nothing, it's slowing down the rest of the system, including the thread that's supposed to write data into the pipe.

Anyway, this algorithm makes it 10x faster, and about half the time compared to the 36 line version, but it's still 4x slower than the serial oneyori version:

diff --git a/lib/lineread.c b/lib/lineread.c
index e985595..2b5271f 100644
--- a/lib/lineread.c
+++ b/lib/lineread.c
@@ -506,7 +506,11 @@ YoriLibReadLineToStringEx(

             Sleep(DelayTime);
             CumulativeDelay += DelayTime;
-            DelayTime *= 2;
+            if (DelayTime < 10) {^M
+                DelayTime++;^M
+            } else {^M
+                DelayTime = DelayTime * 5 / 4;^M
+            }^M
             if (DelayTime > 500) {
                 DelayTime = 500;
             }
aleaksunder commented 3 years ago

Now I get it...

I have tested it with oneyori.exe but get the same results with 12 seconds but that was just because my oneyori.exe was in the same directory with iconv.exe and repl.exe ( due to it was compiled in that directory ) and it processes script with externals, not with builtins! Now I moved oneyori.exe away, tested and it is blazingly fast... and nice to see improvement for that "wait" logic

Let me explain why I use Yori at all and why I love it so much =)

It is simple... It is straightforward... It simply works! Everywhere!!! I use it just for that... You script one time and it works! So I can write script.cmd put it together with x86, x64, arm64 and even arm32 builds and in scenario with oneyori.exe all that setup will be 2 megabytes in size... it is perfect! Since I can be pretty sure that my script.cmd will use native tools on every platform and I simply can avoid all that stuff with folder redirection on x64 windows and x86 tools and much more, I just do not want to remember all that stuff that I can avoid with Yori! This will work even in Windows PE which by design do not support execution of other platform architecture tools! So I basically use Yori as a perfectly portable scripting tools

And absolutely no doubt oneyori.exe is a perfect tool for this case and every case I can imagine... but the one reason I avoid to use it is that oneyori.exe for x86 platform is fakely detected with antivirus software image

Obviously there is no virus at all and this is completely false alarm... but this stops me to use oneyori.exe and prefer to use yori.exe with externals since there is no false alarms on these and by the way only oneyori.exe for x86 is false alarmed, other builds for x64, arm64 and arm32 is clean by every antivirus on VirusTotal!

So basically I can not send my script along with oneyori.exe builds to someone and avoid situation where have to say that there it no virus and this is false alarm... this is simply confusing people!

I've dig to it a couple of weeks ago and find an opinion that DLL-injection may cause this false alarms. Sorry, I can not provide you a link, since this was long time ago! But obviously there is a reason why that false alarms exists... and I simply see yuishook.dll file in Yori's directory! I do not know maybe this DLL is injected... so anyway this is just a theory and assumption

So maybe we can add alternate build of Yori with all builtins that will not be false alarmed and maybe consist of two files: twoyori.exe and twoyori.dll =)

aleaksunder commented 3 years ago

And by the way I can help to determine what causes this false alarm... my logic here is simple: yori.exe is clean, but oneyori.exe is false alarmed... so basically there is a thing that oneyori.exe get compiled with and this causes a false alarm! So if you can tell me the file or maybe files where imports to oneyori.exe is defined I can do this step by step: remove one imported built-in command, compile and check by VirusTotal... this is not hard for may and may be can help

malxau commented 3 years ago

Oneyori is constructed in sh\yorifull.c and sh\yorifull.lst . The first file defines builtin command functions and has a table of strings mapping to them, and the second defines how to link the code.

That said, I'm not seeing virustotal flag anything in the 1.40 oneyori.exe binary, but it really looks like none of the scanners have seen it before: https://www.virustotal.com/gui/file/acf01971ac3813548ec4ded974522b972cae2adf6400196cb6c68f68d31d945c/detection

As far as I know, the most problematic parts for virus scanners is the update code (in ysetup, ypm, and yget.) This is very boring code that wraps wininet.dll in the simplest possible way to download updates. Unfortunately, virus downloaders also wrap wininet.dll in the simplest possible way to download code, so it's hard for virus scanners to tell the difference. They appear to do this with thumbprint heuristics that describe patterns in "good" and "bad" versions of what is essentially the same code. I wouldn't be surprised at all if using a different compiler results in different outcomes.

aleaksunder commented 3 years ago

Thaks, jus figured out how yorifull.c and yorifull.lst corresponds... and now I do not think that there problems only with code that uses wininet... so I've just removed all from yorifull.lst and corresponding lines from yorifull.c and left only this in yorifull.lst:

 ..\assoc\builtins.lib

Than compiled and uploaded oneyori.exe to VirusTotal and get no false positives at all... then I've added yattrib to yorifull.lst and yorifull.c, compiled uploaded and get false alarm... removed yatrrib and add cal... this time there was no false positive! So now I am going to add all commands one by one...

Adding attrib, cab, clip to yorifull.lst and yorifull.c results in .exe that gets false alarm... strange enough is that if for example I remove all and compile oneyori.exe with attrib only it gets no alarms, but if I add assoc and attrib then there is alarm...

At this time I have only this in my yorifull.lst:

..\assoc\builtins.lib
..\cal\builtins.lib
..\charmap\builtins.lib
..\cls\builtins.lib

And compiled exe is not detected as virus on VirusTotal at all... so I am going to do it for all and will post the result later

aleaksunder commented 3 years ago

This is completely weird and now I feel that I am not in the right direction... so when it come to ECHO and I added it to yorifull.c and yorifull.lst compiled with it and tested I get a false alarm! And than I thought that ECHO is already builtin in yori.exe and things become really strange since there is no false alarm on yori.exe... and things becomes weirder and weirder with below:

..\assoc\builtins.lib
..\cal\builtins.lib
..\charmap\builtins.lib
..\cls\builtins.lib

no false alarm

..\which\builtins.lib
..\wininfo\builtins.lib
..\winpos\builtins.lib

no false alarm

but together:

..\assoc\builtins.lib
..\cal\builtins.lib
..\charmap\builtins.lib
..\cls\builtins.lib
..\which\builtins.lib
..\wininfo\builtins.lib
..\winpos\builtins.lib

There IS false alarm!

I build Yori with Visual Studio 2019 Developer Command Prompt v16.8.2 and v14.16.27043.0 compiler tools

aleaksunder commented 3 years ago

So finally I gave up =) No logic at all...

..\assoc\builtins.lib

no false alarm

..\attrib\builtins.lib

no false alarm

..\assoc\builtins.lib
..\attrib\builtins.lib

FALSE ALARM =)

Now I've tested your builds which is downloaded with YPM and oneyori.exe for x86 goes through the VirusTotal without any alarms... So I'm just going to switch to your builds instead of building it on my own, the only question I have here is how frequent download-daily is updated? Current files dated 18 Jan 2021... so the question is: Is currently published source on github is download-daily builds?

malxau commented 3 years ago

I'm not surprised. The problem with heuristic detection based on previously observed "known bad" or "known good" binaries is that small, unrelated changes cause a binary to look less like a "known good" binary and more likely to be flagged.

As you say, download-daily aren't really daily. They're the builds that I'm testing, which often comprise things that are in github but also changes that aren't in github yet and I'm trying to validate they work well enough to publish. But note that I have the same AV problem you do - because these builds add things like attrib, there's no guarantee that a heuristic based virus scanner won't complain about them.

If you want to generate builds that are very similar to mine, I'm using Visual C++ 6.0 SP5 with the processor pack (which contains MASM) for 32 bit builds, and the Windows 2003 SP1 SDK for 64 bit builds. These older compilers ensure the binaries will run on anything and can't depend on newer technologies without being manually coded to check for them.