kbenoit / LIWCalike

R package to extend quanteda to mimic LIWC
36 stars 6 forks source link

Errors in LIWCalike, esp Dic #6

Closed kbenoit closed 6 years ago

kbenoit commented 6 years ago

From m.bellmann@t-online.de:

LIWCalike seems to be a useful addition to Quanteda, but it also provides a useful interface to other R packages, especially in advanced statistics. Of course with LIWCalike there are huge advantages available for establishing a smooth workflow of text analysis.

I myself have been using the German standalone version of the LIWC2007 for many years because I am working mainly with text in German language. I am quite familiar with the dictionaries available for this purpose. LIWC2015 is an improvement but still lacks capacities for the German language. So I still stay with LIWC2007.

When studying the website https://github.com/kbenoit/LIWCalike, I came across a strange thing that puzzle me. As can be seen in the representation of the results of the example, Dic values ​​of 166.67 and 142.86 and 300.00 can be found. However, such results are not possible in the logic of the LIWC. Dic and other categories are percentages of the ratings for the respective category in relation to the total number of words.

This oddness has prompted me to make two comparisons between LIWC and LIWCalike:

  1. I have analyzed 28 German-language texts from my practice with the German LIWC dictionary 2007, once with the LIWC and with LIWCalike. These texts (in fact, they are text fragments) have been thoroughly analyzed for other purposes by other methods, inter alia. also with Quanteda and tidy text. The result of the comparison can be found in the attached file LIWC_LIWCalike_comparison_1.xlsx. The LIWC values ​​seem to me in order, also in comparison with the results out of the German-language dictionary LIWC2001. The LIWCalike values ​​deviate in most cases from this and exceed 100% in many categories. I initially had the guess that it is due to the formatting of the dictionary, which LIWCalike can not read correctly. But a conversion of the dictionary into the cat format (Wordstat) and the re-calculation with LIWCalike leads to the same unsatisfactory results.

  2. On the other hand I have analyzed the 58 texts from your data_corpus_inaugural with the English dictionary LIWC2001 with LIWC and LIWCalike. The result is found in LIWC_LIWCalike_Comparison_2.xlsx. Some (small) differences between LIWC and LIWCalike may be explained by the fact that I had to extract the texts from data_corpus_inaugural for the LIWC (because I do not have the individual text files of the inaugural speeches). However, the values ​​for category Dic are not conclusive in any case.

I am quite sure after these two tests that the code of LIWCalike must have one or more errors. Of course it is also possible that the formatting of the dictionaries will hide errors. However, since I have developed, tested and applied a series of customized LIWC dictionaries for the purpose of my work, I exclude this source of error here.

Perhaps my little review is helpful for you and your team to dive into LIWCalike-code again. As I said, it would have many advantages if this alternative to the LIWC could be used reliably. Then, however, the results of both methods should be sufficiently consistent. LIWC_LIWCalike_Comparison_2.xlsx LIWC_LIWCalike_comparison_1.xlsx

HaiyanLW commented 6 years ago

@kbenoit Could you run branch issue-7 on data_corpus_inaugural and paste the result? The Dic should be related to the sum of dictionary counts and currently on master it is not added up. LIWC doesn't have a Linux version.

kbenoit commented 6 years ago

Done, see https://github.com/kbenoit/LIWCalike/blob/master/tests/data/LIWC2015_Results_Washington.csv.

HaiyanLW commented 6 years ago

As I mentioned the error in Dic may be caused by counting a word multiple times if the word is in multiple entries in the dictionary, for example,

> txt <- c("The red-shirted lawyer gave her ex-boyfriend $300 out of pity :(.")
> myDict <- quanteda::dictionary(list(people = c("lawyer", "boyfriend", "red"),
                          colorFixed = "red",
                           colorGlob = "r*d",
                           mwe = "out of"))

red will be counted 3 times for Dic. It can be solved if we generate the dfm against a dictionary with all unique entries. I can flatten the dictionary and remove the duplicated red, just wondering how to deal with the glob pattern?

kbenoit commented 6 years ago

Probably the easiest is to count the Dic matches this way:

toks <- tokens(txt) ntype(tokens_select(toks), myDict) text1 14 tokens_select(toks, myDict) tokens from 1 document. text1 : [1] "red-shirted" "lawyer" "out" "of"

ntype(tokens_select(toks, myDict)) text1 4

The ntype() for the match is equal to the number of terms from the tokens that matched any dictionary value, but counts them only a single time (unique words, or “types”).

Ken

On 6 Nov 2017, at 17:15, Haiyan Wang notifications@github.com<mailto:notifications@github.com> wrote:

As I mentioned the error in Dic may be caused by counting a word multiple times if the word is in multiple entries in the dictionary, for example,

txt <- c("The red-shirted lawyer gave her ex-boyfriend $300 out of pity :(.") myDict <- quanteda::dictionary(list(people = c("lawyer", "boyfriend", "red"), colorFixed = "red", colorGlob = "r*d", mwe = "out of"))

red will be counted 3 times for Dic. It can be solved if we generate the dfm against a dictionary with all unique entries. I can flatten the dictionary and remove the duplicated red, just wondering how to deal with the glob pattern?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/kbenoit/LIWCalike/issues/6#issuecomment-342218900, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACFMZvbchZn6NjwI_m5bD8TOqcdln9AYks5szz6TgaJpZM4QFehC.