Closed MichaelChirico closed 1 year ago
Thanks for spotting this.
Note that str_*_boundaries
also support custom sets of work-break rules (whole definition files passed via opts_brkiter
- type
)
The changed word.txt is:
icu4c/source/data/brkitr/rules/word.txt
HTH
Here's what I'm seeing as the current behavior on ASCII characters:
ascii_chars = sapply(as.raw(1:127), rawToChar)
# ICU < 72
paste0("a", ascii_chars, "a ", ascii_chars, "a a", ascii_chars) |>
stri_split_boundaries(type = "word") |>
lengths() |>
setNames(ascii_chars)
# \001 \002 \003 \004 \005 \006 \a \b \t \n \v \f \r \016 \017 \020
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 8
# ! " # $ % & ' ( ) * + , - . / 0
# 9 9 9 9 9 9 7 9 9 9 9 9 9 7 9 5
# 1 2 3 4 5 6 7 8 9 : ; < = > ? @
# 5 5 5 5 5 5 5 5 5 7 9 9 9 9 9 9
# A B C D E F G H I J K L M N O P
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# Q R S T U V W X Y Z [ \\ ] ^ _ `
# 5 5 5 5 5 5 5 5 5 5 9 9 9 9 5 9
# a b c d e f g h i j k l m n o p
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# q r s t u v w x y z { | } ~ \177
# 5 5 5 5 5 5 5 5 5 5 9 9 9 9 9
# ICU >= 72
# \001 \002 \003 \004 \005 \006 \a \b \t \n \v \f \r \016 \017 \020
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
# \021 \022 \023 \024 \025 \026 \027 \030 \031 \032 \033 \034 \035 \036 \037
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 8
# ! " # $ % & ' ( ) * + , - . / 0
# 9 9 9 9 9 9 7 9 9 9 9 9 9 7 9 5
# 1 2 3 4 5 6 7 8 9 : ; < = > ? @
# 5 5 5 5 5 5 5 5 5 9 9 9 9 9 9 5
# A B C D E F G H I J K L M N O P
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# Q R S T U V W X Y Z [ \\ ] ^ _ `
# 5 5 5 5 5 5 5 5 5 5 9 9 9 9 5 9
# a b c d e f g h i j k l m n o p
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# q r s t u v w x y z { | } ~ \177
# 5 5 5 5 5 5 5 5 5 5 9 9 9 9 9
Highlighting the differences:
: 7 --> 9
@ 9 --> 5
I'm not sure how well-tested (or even intentional) the current implementation is, I assume #
and @
are the most important characters; #
is unaffected.
Thank you all for bringing this to my attention. CRAN has set a deadline of 04 Dec 2022 to fix this.
@kbenoit: I believe the tokenizer for tweets was your contribution. Would you be willing, please, to submit a fix?
Sure will do.
Thanks, Ken. Much appreciated.
I have not received a patch from @kbenoit in time and CRAN will be pulling the package and packages that depend on it very shortly. I will be removing the tokenize_tweets()
function since it is unmaintained and too specialized for this package anyway. I anticipate pushing up a fix to CRAN shortly.
I will be writing to the package maintainers affected to let them know to watch this issue.
Sorry Lincoln, end of our semester has not provided me enough time to address the issue. But I think your solution is best. There are better ways we to address this now including smarter, customised ICU rules via stringi.
@kbenoit Thanks for confirming, Ken.
After running reverse dependency checks, I note two potential problems.
@kbenoit: It appears that quanteda has a call to tokenize_tweets()
but only in a test file. I assume that is a relatively minor fix for you.
@juliasilge It appears that tidytext wraps tokenize_tweets()
and has some tests to check that functionality. I am sorry this will entail changes on your end, but I don't see any way around it.
The other way around it, and to avoid breaking changes, would be for me to fix tokenize_tweets()
tomorrow, which would be the better solution, since some people may be using this. To remove a function without first deprecating it is not very good practice anyway. Give me a day and I'll get around to this.
We already agreed that you would make those changes, @kbenoit. I agree that is bad practice to remove a function without deprecating it, but it is worse to have a package archived, and we have run out of time because I did not receive the promised fix. I've already spent as much time waiting for this as I am going to spend, including doing all the checks of other packages today. I am sending this fix to CRAN.
Fair enough @lmullen. The CRAN deadlines don't always come at times I can manage, and I have put out about three CRAN-related fires lately (two caused by changes in the Matrix package). That's the cost of avoiding the relative chaos of PyPI, I guess.
The new version of tidytext with token = "tweets"
deprecated (v0.4.0) is now on CRAN; hopefully the new version of tokenizers can get on CRAN without more trouble soon. 🤞
Hope you all have a joyful holiday season, without any more CRAN surprises!
@juliasilge Thank you.
@kbenoit I completely agree about the unreasonableness of CRAN's timing. (And don't get me started about their inability to follow HTTP redirects.) This is a worse outcome, which I regret, but it's an unfortunate outcome of CRAN's policies. Thanks for your willingness to work on this.
See the release notes:
https://icu.unicode.org/download/72
In particular:
That will break a test assuming the opposite:
https://github.com/ropensci/tokenizers/blob/16cd6c7e04675409b381eac0bf0041cf261cfd48/tests/testthat/test-tokenize_tweets.R#L27-L30
Minimal example illustrating the change on
stringi
installed with ICU versions <72, >= 72:That logic is used in
tokenizers
here:https://github.com/ropensci/tokenizers/blob/16cd6c7e04675409b381eac0bf0041cf261cfd48/R/tokenize_tweets.R#L70
This may not break soon because IINM @gagolews is bundling versioned copies of ICU along with the package, so it may not be urgent, but it may be better to use a more robust approach earlier than later.