lythx / trakman

Trackmania Forever server controller written in TypeScript
https://trakman.ptrk.eu
MIT License
26 stars 5 forks source link

update the regex for strip() #133

Closed wsrvn closed 2 years ago

wsrvn commented 2 years ago

while our solution works just fine for the majority of the cases, there still appear to be a few outliers that manage to break the matching. note that the algorithm may be unoptimised too, so thats a thing to look out for as well.

for reference, heres the (fully commented) presumably working version of the function, courtesy of xymph (WARNING PHP):

// (markdown is a bit trippy here, everything preceded by // is a comment despite the incorrect highlighting)
function stripFormatting($input, $for_tm = true) {
    return
        // Replace all occurrences of a null character back with a pair of dollar signs for displaying in TM, or a single dollar for log messages etc.
        str_replace("\0", ($for_tm ? '$$' : '$'),
            // Replace links (introduced in TMU)
            preg_replace(
                '/
                // Strip TMF H, L & P links by stripping everything between each square bracket pair until another $H, $L or $P sequence (or EoS) is found, this allows a $H to close a $L and vice versa, as does the game
                \\$[hlp](.*?)(?:\\[.*?\\](.*?))*(?:\\$[hlp]|$)
                /ixu',
                // Keep the first and third capturing groups if present
                '$1$2',
                // Replace various patterns beginning with an unescaped dollar
                preg_replace(
                    '/
                    // Match a single dollar sign and any of the following:
                    \\$
                    (?:
                        // Strip color codes by matching any hexadecimal character and any other two characters following it (except $)
                        [0-9a-f][^$][^$]
                        // Strip any incomplete color codes by matching any hexadecimal character followed by another character (except $)
                        |[0-9a-f][^$]
                        // Strip any single style code (including an invisible UTF8 char) that is not an H, L or P link or a bracket ($[ and $])
                        |[^][hlp]
                        // Strip the dollar sign if it is followed by [ or ], but do not strip the brackets themselves
                        |(?=[][])
                        // Strip the dollar sign if it is at the end of the string
                        |$
                    )
                    // Ignore alphabet case, ignore whitespace in pattern & use UTF-8 mode
                    /ixu',
                    // Replace any matches with nothing (i.e. strip matches)
                    '',
                    // Replace all occurrences of dollar sign pairs with a null character
                    str_replace('$$', "\0", $input)
                )
            )
        )
    ;
}

ignoring the comments and """fancy""" format (aka i love wasting space in my editor):

function stripColours($input, $forTM = true) {
    return 
    str_replace("\0", ($forTM ? '$$' : '$'), 
    preg_replace('/\\$[hlp](.*?)(?:\\[.*?\\](.*?))*(?:\\$[hlp]|$)/ixu', '$1$2', 
    preg_replace('/\\$(?:[0-9a-f][^$][^$]|[0-9a-f][^$]|[^][hlp]|(?=[][])|$)/ixu', '', 
    str_replace('$$', "\0", $input))));
}

forTM can obviously be ignored. it is unused even in xaseco itself XD.

NOTE that the function above does not remove the other formatting codes, for that \$[SHWIPLONGTZ]\ should be used (usually as the last OR statement)

here's a list of entries that should be tested against the regex to ensure its working (use regexr or something similar, dont overcomplicate your life lol):

$0CFωα$0CFғ$0CFα$i$000$i | $3CFNexogg GF
$L[goo.gl/UJy69u]$fffмσтι$0C6$fffи|ғ
$H[manialinker]$Hheythere$P[manialinker]
$000[$5f6CMC$000]$b$f60GuessWho
$B$FFFUber Bug
TM$5$6/$0$ARU DES
|$fadǵ$faeƒ$faf.$i$fffsabø$f|
$i$adf$wP$fffazeh
$<thefunnyname$>
dollar$$dollar

expected output:

ωαғα | Nexogg GF
мσтιи|ғ
heythere
[CMC]GuessWho
Uber Bug
TM DES
ǵƒ.sabø
Pazeh
thefunnyname
dollar$$dollar

current output:

ωαғα | Nexogg GF
мσтιи|ғ
heythere
[CMC]$bGuessWho
$BUber Bug
TM$5$0 DES
|ǵƒ.sabø
Pazeh
$<thefunnyname$>
dollar$lar
wsrvn commented 2 years ago

by adding |\$[^\$]{1} as the last OR, we change the output into

ωαғα | Nexogg GF
мσтιи|ғ
heythere
[CMC]GuessWho
Uber Bug
TM DES
|ǵƒ.sabøPazeh
thefunnyname
dollar$lar

which is a significant improvement. the only thing left now is.. matching the double dollar and i have no clue how. in xaseco, this is done by replacing it with null \0 before regex, and placing it back in the last operation. guess we could do that as well.