Closed MaxDaten closed 2 years ago
Can confirm (Linux, ICU 56.1). For me it always happens after ~6300 processed words, no matter what part of the text I take.
It seems to have something to do with garbage collection. Exhibit A, doesn't fail:
{-# LANGUAGE OverloadedStrings, ScopedTypeVariables #-}
import qualified Data.Text.IO as T
import qualified Data.Text as T
import qualified Data.Text.ICU as ICU
import qualified Data.Text.ICU.Break as IO
import Unsafe.Coerce
import Data.Text.Foreign
main = do
file <- T.readFile "test.txt"
breaks' (ICU.breakWord "en-US") file
return ()
breaks' :: forall a. ICU.Breaker a -> T.Text -> IO ()
breaks' b t = do
bi :: IO.BreakIterator IO.Word <-
IO.clone (unsafeCoerce (b :: ICU.Breaker a))
IO.setText bi t
let go p = do
mix <- IO.next bi
case mix of
Nothing -> return ()
Just n -> do
s <- IO.getStatus bi
let d = n-p
u = dropWord16 p t
print (n, p, takeWord16 d u)
go n
go =<< IO.first bi
Exhibit B, fails unless run with +RTS -A27M
(if you use 26M
a few letters are chopped off the last word, 24M
makes the whole last word break, 22M
breaks the last several sentences, etc):
breaks' :: forall a. ICU.Breaker a -> T.Text -> IO [I16]
breaks' b t = do
bi :: IO.BreakIterator IO.Word <-
IO.clone (unsafeCoerce (b :: ICU.Breaker a))
IO.setText bi t
let go p = do
mix <- IO.next bi
case mix of
Nothing -> return []
Just n -> do
s <- IO.getStatus bi
let d = n-p
u = dropWord16 p t
print (n, p, takeWord16 d u)
(n:) `fmap` go n
go =<< IO.first bi
Should be fixed in text-icu-0.8.0
.
BreakIterator
did not keep a reference to ICU's BreakIterator
text, so it became garbage after GC and ICU returned breaks for some random text.
https://github.com/haskell/text-icu/commit/d9b00c6b574682f08e8be7a8d7db764bd048b881#diff-de0c07896d55f089d8b59b81ebc256ac60f253019705a7c4e0d9ca3a042f479dL24
I suppose that #4 had the same cause.
Hi,
first of it all: thank you for the library.
I bumped against a strange problem with word breaking on a large amount text.
With test.txt (just c&p from Wikipedia Haskell) and this snipped:
ICU starts somewhere in the middle to break on character border, here is the critical transition:
After this point, nearly every character isolated. But not always, sometimes chars are bundled pairwise.
Note: I experienced this bug first with german text extracted from epub chapters. The behavior seems a bit chaotic: Mainly chars are seperated, but somethimes words or parts of word are surviving.
I'm using
icu4c/56.1
on OS X installed viabrew install icu4c
.