haskell / filepath

Haskell FilePath core library
BSD 3-Clause "New" or "Revised" License
66 stars 33 forks source link

`System.OsPath.Data.ByteString.Short.Word16.isInfixOf` looks dodgy #195

Closed sol closed 12 months ago

sol commented 1 year ago

I haven't tried anything, but from reading the code, it looks like this is just a re-export from bytestring. Consequently, I think it can yield false positives if you were to use it for substring matching of UTF-16.

Is this by intention?

hasufell commented 1 year ago

Hi, windows is not really UTF-16, but UCS-2. Windows does not enforce well formed surrogate pairs (as in: it accepts any sequence of WCHARs).

Does that answer your question?

sol commented 1 year ago

Does that answer your question?

Not really. What I'm saying is that while you can of course use e.g. isPrefixOf and isSuffixOf from Data.ByteString.Short on UTF-16 (and UCS-2 for that matter) without getting surprising results, this is not generally true for isInfixOf.

If you use Data.ByteString.Short.isInfixOf for substring matching on UTF-16 you can get false positives (the same is true for e.g. UTF-32, but notably not for UTF-8).

Example:

ghci> import System.OsString.Internal.Types
ghci> foo <- System.OsPath.Windows.encodeFS "λ"   -- UTF-16:      0xbb 0x03
ghci> bar <- System.OsPath.Windows.encodeFS "믒괃" -- UTF-16: 0xd2 0xbb 0x03 0xad
ghci> System.OsPath.Data.ByteString.Short.Word16.isInfixOf (getWindowsString foo) (getWindowsString bar)
True

That's why I was surprised to see that System.OsPath.Data.ByteString.Short.Word16.isInfixOf is simply a re-export of Data.ByteString.Short.isInfixOf.

So my question still remains: Is this an oversight, or is there some rational behind this.

hasufell commented 1 year ago

Yeah, I think you're right. We have to respect Word16 boundaries.

hasufell commented 1 year ago

Ok, two more observations:

  1. This function is not used anywhere by filepath
  2. It's also considered internal, although not marked as such

So yes, this needs to be fixed, but the impact is very likely low.

hasufell commented 1 year ago

I think consequently, breakSubstring is also busted.

sol commented 1 year ago

Yes, exactly.

From when I checked last time, I think only those two are problematic.

hasufell commented 1 year ago

@Bodigrim I'm not too familiar with the breakSubstring algorithm and from a little tinkering I couldn't figure out how to make it work for Word16 boundaries.

Do you have advice?

https://github.com/haskell/bytestring/blob/dac5675ea9dae639758e21314a49b15fa90e2bd7/Data/ByteString.hs#L1584-L1634

I tried using a proper definition of breakByte, but it seems there's more to it.

Bodigrim commented 1 year ago

You can run Data.ByteString.breakSubstring and check the length of the prefix. If it's even, all good, you are done. If it's odd, slice the input string past the prefix and run Data.ByteString.breakSubstring again.

hasufell commented 12 months ago

This is released in https://hackage.haskell.org/package/filepath-1.4.100.4 commit https://github.com/haskell/filepath/commit/367f6bffc158ef1a9055fb876e23447636853aa4