[Tokenization] LIWC splits on periods in abbreviations but liwcalike() does not

Dear Professor Benoit, thanks for this great package!

When I explored and tested liwcalike() with Washington's Inaugural Speech text 1789 & 1793, the output of liwcalike() is the same as that of LIWC2015 (as shown in Pennebaker's LIWC tutorial 2 at https://youtu.be/fYLobCxHP5w?t=204) if I set remove_punct = TRUE, remove_symbols = TRUE. I did another testing with the JonBenet Ramsey ransom note, and then I found that the output of liwcalike() differs from that of LIWC2015 (as shown in tutorial 3 at https://youtu.be/YqgBViXWKoM?t=79). It seems that LIWC splits on periods (full stops) in abbreviations but quanteda Tokenizers do not. It is because the personal pronouns analysis in the tutorial at https://youtu.be/YqgBViXWKoM?t=200 included the "I" from "F.B.I." in the ransom note.

Since I can only refer to Pennebaker's LIWC tutorials (or try LIWC22), my comparison won't be accurate. But I created liwcalike_testing_data2.csv which contains Washington's Inaugural Speech text and the JonBenet Ramsey ransom note for anyone who has LIWC2015 software to compare the results.

To use quanteda's default word tokenizer but split on periods, gsub() can help to preprocess the text by adding whitespace after periods. I wrote the following code:

df <- readr::read_csv("liwcalike_testing_data2.csv")

# Add whitespace after periods when there is no whitespace after them and the next character is a letter or digit
x <- gsub("\\.(?=[A-Za-z0-9])", ". ", df$text, perl = TRUE)

# If LIWC2015 dictionary file is not available, follow liwcalike() to use tokens(x, split_hyphens = TRUE, ...) and count the words using ntoken()
quanteda::ntoken(quanteda::tokens(x, split_hyphens = TRUE, remove_punct = TRUE, remove_symbols = TRUE))

# If LIWC2015 dictionary file is available, read the dictionary file (e.g., liwc2015.dic) and then use liwcalike()
liwc_dict <- quanteda::dictionary(file = "liwc2015.dic")
quanteda.dictionaries::liwcalike(x, dictionary = liwc_dict, remove_punct = TRUE, remove_symbols = TRUE)

When investigating whether LIWC splits on periods, I also found that LIWC is able to count digits as numbers even though the dictionary file only contains words (e.g., billion* and five). liwcalike() only found 1 number (the word 'two') in the JonBenet Ramsey ransom note, but I saw that Pennebaker's LIWC2015 tutorial shows 3.66% of the ransom note (Word Count 382) under the Numbers category (which means 14 numbers were found in the text), so it seems that LIWC2015 will also split on commas and check if the token is of numeric type when counting numbers.

I hope what I found and shared is helpful. Thanks again for this great package and have a Happy New Year!

Many thanks, Erica

kbenoit / quanteda.dictionaries

[Tokenization] LIWC splits on periods in abbreviations but liwcalike() does not #38