haskell / text-icu

This package provides the Haskell Data.Text.ICU library, for performing complex manipulation of Unicode text.
BSD 2-Clause "Simplified" License
47 stars 41 forks source link

Corrupted word breaking with fairly large text #19

Closed MaxDaten closed 2 years ago

MaxDaten commented 8 years ago

Hi,

first of it all: thank you for the library.

I bumped against a strange problem with word breaking on a large amount text.

With test.txt (just c&p from Wikipedia Haskell) and this snipped:

{-# LANGUAGE OverloadedStrings #-}
import Data.Text.IO as T
import Data.Text.ICU as ICU
fmap ICU.brkBreak . ICU.breaks (ICU.breakWord "en-US") <$> T.readFile "test.txt"

ICU starts somewhere in the middle to break on character border, here is the critical transition:

[...,"properties"," ","of"," ","programs","\n","Cayenne",","," ","with"," ","dependent"," ","types","\n","\937mega",","," ","strict"," ","and"," ","more","\n","Elm",","," ","a"," ","functional"," ","language"," ","to"," ","create"," ","web"," ","front","-","end"," ","apps",","," ","no"," ","s","u","p","p","o","r","t"," ","f","o","r"," ","h","i","gh","e","r","-","k","i","n","d","e","d"," ","t","y","p","e","s","\n","J","V","M","-","b","a","s","e","d",":","\n","\n","F","r","eg","e",","," ","a"," ","H","a","s","k",...]

After this point, nearly every character isolated. But not always, sometimes chars are bundled pairwise.

Note: I experienced this bug first with german text extracted from epub chapters. The behavior seems a bit chaotic: Mainly chars are seperated, but somethimes words or parts of word are surviving.

I'm using icu4c/56.1 on OS X installed via brew install icu4c.

neongreen commented 8 years ago

Can confirm (Linux, ICU 56.1). For me it always happens after ~6300 processed words, no matter what part of the text I take.

It seems to have something to do with garbage collection. Exhibit A, doesn't fail:

{-# LANGUAGE OverloadedStrings, ScopedTypeVariables #-}

import qualified Data.Text.IO as T
import qualified Data.Text as T
import qualified Data.Text.ICU as ICU
import qualified Data.Text.ICU.Break as IO
import Unsafe.Coerce
import Data.Text.Foreign

main = do
  file <- T.readFile "test.txt"
  breaks' (ICU.breakWord "en-US") file
  return ()

breaks' :: forall a. ICU.Breaker a -> T.Text -> IO ()
breaks' b t = do
  bi :: IO.BreakIterator IO.Word <-
    IO.clone (unsafeCoerce (b :: ICU.Breaker a))
  IO.setText bi t
  let go p = do
        mix <- IO.next bi
        case mix of
          Nothing -> return ()
          Just n -> do
            s <- IO.getStatus bi
            let d = n-p
                u = dropWord16 p t
            print (n, p, takeWord16 d u)
            go n
  go =<< IO.first bi

Exhibit B, fails unless run with +RTS -A27M (if you use 26M a few letters are chopped off the last word, 24M makes the whole last word break, 22M breaks the last several sentences, etc):

breaks' :: forall a. ICU.Breaker a -> T.Text -> IO [I16]
breaks' b t = do
  bi :: IO.BreakIterator IO.Word <-
    IO.clone (unsafeCoerce (b :: ICU.Breaker a))
  IO.setText bi t
  let go p = do
        mix <- IO.next bi
        case mix of
          Nothing -> return []
          Just n -> do
            s <- IO.getStatus bi
            let d = n-p
                u = dropWord16 p t
            print (n, p, takeWord16 d u)
            (n:) `fmap` go n
  go =<< IO.first bi
vshabanov commented 2 years ago

Should be fixed in text-icu-0.8.0.

BreakIterator did not keep a reference to ICU's BreakIterator text, so it became garbage after GC and ICU returned breaks for some random text. https://github.com/haskell/text-icu/commit/d9b00c6b574682f08e8be7a8d7db764bd048b881#diff-de0c07896d55f089d8b59b81ebc256ac60f253019705a7c4e0d9ca3a042f479dL24

I suppose that #4 had the same cause.