Compare liwc_orig and liwc_alt on a random sample (n=50)

Yvonne-Han commented 4 years ago

@iangow I've created a new issue for the code and results related to comparing liwc_orig and liwc_alt on a randomly selected n=50 sample utterances of speaker_data.

Yvonne-Han commented 4 years ago

@iangow Let's talk more about this tmr during zoom/skype meeting. But here's a brief summary of what I've done:

I've uploaded a notebook (see here) that did the following:

Generate a random sample (n=50)
Export the speaker_text as separate text files (to be imported into LIWC software) (texts are stored in sample_50)
Calculate liwc_alt results for this n=50 sample
Import liwc_orig results for the same sample
Compare liwc_alt with liwc_orig and store the results in .csv formats (see here for the raw difference and here for the percentage difference)

Yvonne-Han commented 4 years ago

Among all 50 files * 73 categories/file = 3650 categories, liwc_alt differs from liwc_orig in 52 categories (which is 52/3650 = 1.4%), so I guess it's not too bad? 🤦

Also - I took a look at the difference/total_word_count results (i.e., here)- and it seems that the largest error we get for a category is ~2%, with most of them being <1%.

iangow / se_features

Compare liwc_orig and liwc_alt on a random sample (n=50) #37