haskell / text

Haskell library for space- and time-efficient operations over Unicode text.
http://hackage.haskell.org/package/text
BSD 2-Clause "Simplified" License
407 stars 159 forks source link

Should C functions be called FFI-safely on big inputs? #327

Open phadej opened 3 years ago

phadej commented 3 years ago

Something like

diff --git a/src/Data/Text/Encoding.hs b/src/Data/Text/Encoding.hs
index 239e15e..9b6a01c 100644
--- a/src/Data/Text/Encoding.hs
+++ b/src/Data/Text/Encoding.hs
@@ -436,7 +436,10 @@ encodeUtf8 (Text arr off len)
   fp <- mallocByteString (len*3) -- see https://github.com/haskell/text/issues/194 for why len*3 is enough
   withForeignPtr fp $ \ptr ->
     with ptr $ \destPtr -> do
-      c_encode_utf8 destPtr (A.aBA arr) (fromIntegral off) (fromIntegral len)
+      -- When ByteArray# is large enough, it is pinned.
+      -- Then we can use safe FFI calls, which we indeed want for big inputs!
+      (if len > 4096 then c_encode_utf8_safe else c_encode_utf8_unsafe)
+        destPtr (A.aBA arr) (fromIntegral off) (fromIntegral len)
       newDest <- peek destPtr
       let utf8len = newDest `minusPtr` ptr
       if utf8len >= len `shiftR` 1
@@ -535,5 +538,8 @@ foreign import ccall unsafe "_hs_text_decode_utf8_state" c_decode_utf8_with_stat
 foreign import ccall unsafe "_hs_text_decode_latin1" c_decode_latin1
     :: MutableByteArray# s -> Ptr Word8 -> Ptr Word8 -> IO ()

-foreign import ccall unsafe "_hs_text_encode_utf8" c_encode_utf8
+foreign import ccall unsafe "_hs_text_encode_utf8" c_encode_utf8_unsafe
+    :: Ptr (Ptr Word8) -> ByteArray# -> CSize -> CSize -> IO ()
+
+foreign import ccall safe "_hs_text_encode_utf8" c_encode_utf8_safe
     :: Ptr (Ptr Word8) -> ByteArray# -> CSize -> CSize -> IO ()

The benchmark suite doesn't have exercises for multithreaded programs.

As far as I understand the code now, encoding or decoding big Text will prevent GC from running. Is that an issue, hard to tell.

E.g. cryptohash-sha256 branches on size: https://hackage.haskell.org/package/cryptohash-sha256-0.11.102.0/docs/src/Crypto.Hash.SHA256.html#c_sha256_update, (my diff above is probably written better in that way too)

Bodigrim commented 3 years ago

Cf. https://github.com/haskell/bytestring/issues/282