openalm / Extension-UtilitiesPack

Release Management utility tasks
Other
34 stars 38 forks source link

Tokenization Poor Performance With Large Number of Token Instances #115

Open aolszowka opened 1 year ago

aolszowka commented 1 year ago

The Tokenizer appears to perform very poorly when you have a large number of replacement token instances.

For example in a file this line:

https://github.com/openalm/Extension-UtilitiesPack/blob/4747cae037612c5f3e41bdf6e6aa3b285cbc29eb/Utilites/Tokenizer/tokenize-ps3.ps1#L106

Returns 8600 instances on a file I am attempting to have it replace.

Based on the logic of this loop:

https://github.com/openalm/Extension-UtilitiesPack/blob/4747cae037612c5f3e41bdf6e6aa3b285cbc29eb/Utilites/Tokenizer/tokenize-ps3.ps1#L107

This will attempt to perform this operation 8600 times. If you look at the code this loops though the file row, by row, attempting a replacement of all of the variables that are found.

This is inefficent, rather the above line should have gathered distinct values like so (the following tries to follow the PowerShell-isms and is not 100% efficent):

$matches = select-string -Path $tempFile -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } | Sort-Object | Get-Unique

Note that in order to use Get-Unique the documentation states that the list must be ordered, which is why a Sort-Object is called before hand.

Running this on that same file returns a mere 38 instances to have to attempt to replace. which is an order of magnitude smaller than the previous attempts.

There are still other performance issues, for example the row-by-row replacement per variable as seen here:

https://github.com/openalm/Extension-UtilitiesPack/blob/4747cae037612c5f3e41bdf6e6aa3b285cbc29eb/Utilites/Tokenizer/tokenize-ps3.ps1#L145-L149

This becomes painful as the number of lines in the file increase, However, this simple fix would resolve the most obvious performance issue.