ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

Twitter tokenizing logic broken by upcoming ICU 72 breaking change ('@' no longer splits) #82

Closed MichaelChirico closed 1 year ago

MichaelChirico commented 2 years ago

See the release notes:

https://icu.unicode.org/download/72

In particular:

the committee decided that an at sign (@) should not cause word breaks, as in email addresses. (CLDR-15767) — (ICU-22112)

That will break a test assuming the opposite:

https://github.com/ropensci/tokenizers/blob/16cd6c7e04675409b381eac0bf0041cf261cfd48/tests/testthat/test-tokenize_tweets.R#L27-L30

Minimal example illustrating the change on stringi installed with ICU versions <72, >= 72:

# UNDER ICU >=72
stringi::stri_split_boundaries("@abc", type = "word")
[[1]]
[1] "@abc"

# UNDER ICU < 72
stringi::stri_split_boundaries("@abc", type = "word")
[[1]]
[1] "@"   "abc"

That logic is used in tokenizers here:

https://github.com/ropensci/tokenizers/blob/16cd6c7e04675409b381eac0bf0041cf261cfd48/R/tokenize_tweets.R#L70

This may not break soon because IINM @gagolews is bundling versioned copies of ICU along with the package, so it may not be urgent, but it may be better to use a more robust approach earlier than later.

gagolews commented 2 years ago

Thanks for spotting this.

Note that str_*_boundaries also support custom sets of work-break rules (whole definition files passed via opts_brkiter - type)

The changed word.txt is:

icu4c/source/data/brkitr/rules/word.txt

New: https://github.com/unicode-org/icu/blob/49d192fefe09fcc38547203487cf4e63d2dad61f/icu4c/source/data/brkitr/rules/word.txt

Old: https://github.com/unicode-org/icu/blob/af9ef2650be5d91ba2ff7daa77e23f22209a509c/icu4c/source/data/brkitr/rules/word.txt

HTH

MichaelChirico commented 2 years ago

Here's what I'm seeing as the current behavior on ASCII characters:

ascii_chars = sapply(as.raw(1:127), rawToChar)
# ICU < 72
paste0("a", ascii_chars, "a ", ascii_chars, "a a", ascii_chars) |>
  stri_split_boundaries(type = "word") |>
  lengths() |>
  setNames(ascii_chars)
# \001 \002 \003 \004 \005 \006   \a   \b   \t   \n   \v   \f   \r \016 \017 \020 
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9 
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037      
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    8 
#    !    "    #    $    %    &    '    (    )    *    +    ,    -    .    /    0 
#    9    9    9    9    9    9    7    9    9    9    9    9    9    7    9    5 
#    1    2    3    4    5    6    7    8    9    :    ;    <    =    >    ?    @ 
#    5    5    5    5    5    5    5    5    5    7    9    9    9    9    9    9 
#    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    Q    R    S    T    U    V    W    X    Y    Z    [   \\    ]    ^    _    ` 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    5    9 
#    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    q    r    s    t    u    v    w    x    y    z    {    |    }    ~ \177 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    9 

# ICU >= 72
# \001 \002 \003 \004 \005 \006   \a   \b   \t   \n   \v   \f   \r \016 \017 \020 
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9 
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037      
#    9    9    9    9    9    9    9    9    9    9    9    9    9    9    9    8 
#    !    "    #    $    %    &    '    (    )    *    +    ,    -    .    /    0 
#    9    9    9    9    9    9    7    9    9    9    9    9    9    7    9    5 
#    1    2    3    4    5    6    7    8    9    :    ;    <    =    >    ?    @ 
#    5    5    5    5    5    5    5    5    5    9    9    9    9    9    9    5 
#    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    Q    R    S    T    U    V    W    X    Y    Z    [   \\    ]    ^    _    ` 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    5    9 
#    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p 
#    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
#    q    r    s    t    u    v    w    x    y    z    {    |    }    ~ \177 
#    5    5    5    5    5    5    5    5    5    5    9    9    9    9    9 

Highlighting the differences:

: 7 --> 9
@ 9 --> 5

I'm not sure how well-tested (or even intentional) the current implementation is, I assume # and @ are the most important characters; # is unaffected.

lmullen commented 2 years ago

Thank you all for bringing this to my attention. CRAN has set a deadline of 04 Dec 2022 to fix this.

@kbenoit: I believe the tokenizer for tweets was your contribution. Would you be willing, please, to submit a fix?

kbenoit commented 2 years ago

Sure will do.

lmullen commented 2 years ago

Thanks, Ken. Much appreciated.

lmullen commented 1 year ago

I have not received a patch from @kbenoit in time and CRAN will be pulling the package and packages that depend on it very shortly. I will be removing the tokenize_tweets() function since it is unmaintained and too specialized for this package anyway. I anticipate pushing up a fix to CRAN shortly.

I will be writing to the package maintainers affected to let them know to watch this issue.

kbenoit commented 1 year ago

Sorry Lincoln, end of our semester has not provided me enough time to address the issue. But I think your solution is best. There are better ways we to address this now including smarter, customised ICU rules via stringi.

lmullen commented 1 year ago

@kbenoit Thanks for confirming, Ken.

lmullen commented 1 year ago

After running reverse dependency checks, I note two potential problems.

@kbenoit: It appears that quanteda has a call to tokenize_tweets() but only in a test file. I assume that is a relatively minor fix for you.

@juliasilge It appears that tidytext wraps tokenize_tweets() and has some tests to check that functionality. I am sorry this will entail changes on your end, but I don't see any way around it.

kbenoit commented 1 year ago

The other way around it, and to avoid breaking changes, would be for me to fix tokenize_tweets() tomorrow, which would be the better solution, since some people may be using this. To remove a function without first deprecating it is not very good practice anyway. Give me a day and I'll get around to this.

lmullen commented 1 year ago

We already agreed that you would make those changes, @kbenoit. I agree that is bad practice to remove a function without deprecating it, but it is worse to have a package archived, and we have run out of time because I did not receive the promised fix. I've already spent as much time waiting for this as I am going to spend, including doing all the checks of other packages today. I am sending this fix to CRAN.

kbenoit commented 1 year ago

Fair enough @lmullen. The CRAN deadlines don't always come at times I can manage, and I have put out about three CRAN-related fires lately (two caused by changes in the Matrix package). That's the cost of avoiding the relative chaos of PyPI, I guess.

juliasilge commented 1 year ago

The new version of tidytext with token = "tweets" deprecated (v0.4.0) is now on CRAN; hopefully the new version of tokenizers can get on CRAN without more trouble soon. 🤞

Hope you all have a joyful holiday season, without any more CRAN surprises!

lmullen commented 1 year ago

@juliasilge Thank you.

@kbenoit I completely agree about the unreasonableness of CRAN's timing. (And don't get me started about their inability to follow HTTP redirects.) This is a worse outcome, which I regret, but it's an unfortunate outcome of CRAN's policies. Thanks for your willingness to work on this.