Closed sol closed 12 months ago
Hi, windows is not really UTF-16, but UCS-2. Windows does not enforce well formed surrogate pairs (as in: it accepts any sequence of WCHARs).
Does that answer your question?
Does that answer your question?
Not really. What I'm saying is that while you can of course use e.g. isPrefixOf
and isSuffixOf
from Data.ByteString.Short
on UTF-16 (and UCS-2 for that matter) without getting surprising results, this is not generally true for isInfixOf
.
If you use Data.ByteString.Short.isInfixOf
for substring matching on UTF-16 you can get false positives (the same is true for e.g. UTF-32, but notably not for UTF-8).
Example:
ghci> import System.OsString.Internal.Types
ghci> foo <- System.OsPath.Windows.encodeFS "λ" -- UTF-16: 0xbb 0x03
ghci> bar <- System.OsPath.Windows.encodeFS "믒괃" -- UTF-16: 0xd2 0xbb 0x03 0xad
ghci> System.OsPath.Data.ByteString.Short.Word16.isInfixOf (getWindowsString foo) (getWindowsString bar)
True
That's why I was surprised to see that System.OsPath.Data.ByteString.Short.Word16.isInfixOf
is simply a re-export of Data.ByteString.Short.isInfixOf
.
So my question still remains: Is this an oversight, or is there some rational behind this.
Yeah, I think you're right. We have to respect Word16 boundaries.
Ok, two more observations:
So yes, this needs to be fixed, but the impact is very likely low.
I think consequently, breakSubstring
is also busted.
Yes, exactly.
From when I checked last time, I think only those two are problematic.
@Bodigrim I'm not too familiar with the breakSubstring
algorithm and from a little tinkering I couldn't figure out how to make it work for Word16 boundaries.
Do you have advice?
I tried using a proper definition of breakByte
, but it seems there's more to it.
You can run Data.ByteString.breakSubstring
and check the length of the prefix. If it's even, all good, you are done. If it's odd, slice the input string past the prefix and run Data.ByteString.breakSubstring
again.
I haven't tried anything, but from reading the code, it looks like this is just a re-export from
bytestring
. Consequently, I think it can yield false positives if you were to use it for substring matching of UTF-16.Is this by intention?