iangow / se_features

Linguistic features derived from StreetEvents
1 stars 3 forks source link

Sort out LIWC apostrophes #34

Closed Yvonne-Han closed 4 years ago

Yvonne-Han commented 4 years ago

I have to deal with apostrophes again as I believe it is causing some (or maybe all) of the differences between liwc_orig and liwc_alt.

Note that one needs to be care with ', as there's a "curly" version of that character that might be treated differently (and my regular expression code may swap out the curly one for the "straight" one; there would be a separate line for that in the code). This may cause the apparent inconsistency.

_Originally posted by @iangow in https://github.com/iangow/se_features/issues/18#issuecomment-517072302_

BTW, I think this is the line that converts "curly" apostrophes:

https://github.com/iangow/se_features/blob/32e2db2660f506265c890fd7286ae06a9371b864/liwc_2015/liwc_functions.py#L39

_Originally posted by @iangow in https://github.com/iangow/se_features/issues/18#issuecomment-517086556_

iangow commented 4 years ago

OK. I think u'\u2019' may be just one of the curly versions. So you may need to inspect the data more closely. Do you have a sample "utterance" (file_name, speaker_number, etc.) that's causing problems?

Yvonne-Han commented 4 years ago

OK. I think u'\u2019' may be just one of the curly versions. So you may need to inspect the data more closely.

I see. I will take a closer look at this.

Do you have a sample "utterance" (file_name, speaker_number, etc.) that's causing problems?

I don't have it at the moment as I used some other texts to detect the differences. I managed to find the UNICODE documentation of general punctuations so just let me quickly go through everything that looks similar to an apostrophe.

Yvonne-Han commented 4 years ago

Punctuation that looks like apostrophes:

Yvonne-Han commented 4 years ago

image

Yvonne-Han commented 4 years ago

This should be everything related to apostrophes (hopefully) so I'm closing this now.

iangow commented 4 years ago

To close this properly, you should have some (previously) problematic text where the issue is (or appears to be) curly apostrophes and then confirm that now the LIWC software and our Python function gives the same results. Do you have a sample of problematic text? Or is it just these .txt files?

Yvonne-Han commented 4 years ago

To close this properly, you should have some (previously) problematic text where the issue is (or appears to be) curly apostrophes and then confirm that now the LIWC software and our Python function gives the same results.

Yes I did compare the results of LIWC software and my code (in notebook) and confirmed that they are the same now. (I’m sorry that I forgot to post the notebook here...)

Do you have a sample of problematic text? Or is it just these .txt files?

No it’s just these .txt files (with file names indicating the type of punctuation being tested).

But I did tried a couple of randomly selected con call texts and compared the results of LIWC software and my code. The differences in those affected categories either became smaller or went away, so I think we are now one step closer.