iangow / ling_features

Functions for extracting commonly used linguistic features from text.
MIT License
10 stars 5 forks source link

Check word-count function against that in ABCL #4

Open iangow opened 3 years ago

iangow commented 3 years ago

See notebook here for ABCL code.

Vic Anand, Khrystyna Bochkay, Roman Chychyla and Andrew Leone (2020), "Using Python for Text Analysis in Accounting Research", Foundations and Trends® in Accounting: Vol. 14: No. 3–4, pp 128-359.

iangow commented 3 years ago

@yiyangw2 says "The major difference in word_count is that ours tokenizes a string to split off punctuation other than periods but theirs does not. I vote for ours!"

iangow commented 3 years ago

@yiyangw2

This repository has a decent number of followers and seems to have been forked a few times. So perhaps it's pretty good.

Looking at the approach it takes, it seems that it uses sent_tokenize from nltk.tokenize to split text into sentences and then TweetTokenizer (also from nltk.tokenize) to break sentences into words. It then counts non-punctuation characters as words.

def _is_punctuation(self, token):
    match = re.match('^[.,\/#!$%\'\^&\*;:{}=\-_`~()]$', token)
    return match is not None

if not self._is_punctuation(t):
    word_count += 1

It might be worth comparing the word-tokenizer we're using with TweetTokenizer for the sample sentences. I think a question is whether we remove punctuation from sentences and then tokenize and count words or tokenize first (as this package does).

Nothing we do will be perfect, so I think it's just a matter of picking something that works reasonably well.

yiyangw2 commented 3 years ago

I find an example here https://stackoverflow.com/questions/61919670/how-nltk-tweettokenizer-different-from-nltk-word-tokenize

from nltk.tokenize import TweetTokenizer
from nltk.tokenize import  word_tokenize
tt = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @remy: This is waaaaayyyy too much for you!!!!!!"
print(tt.tokenize(tweet))
print(word_tokenize(tweet))

# output
# ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--', '@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']
# ['This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--', '@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']
iangow commented 3 years ago

@yiyangw2 I edited your comment above to format the text better.

match = re.match('^[.,\/#!$%\'\^&\*;:{}=\-_`~()]$', token)

The above expression from is_punctuation is basically checking if the supplied string is nothing but punctuation characters (the items between [ and ]) from start (^) to finish ($). I notice that it doesn't match ?, which I think we'd want.

Here is a notebook that I used to play around with some ideas.

yiyangw2 commented 3 years ago

I do like this ”clean“ word count! Should we replace the old one with this new function?

iangow commented 3 years ago

To be thorough, I would get some more sample sentences and perhaps compare across alternatives. If this function produces the most reasonable answer, then go with it.

Note that I think the data currently in the fog_features is based (in part) on the word_count function. So if we change that, we should re-run the fog code (though you may not be using that data directly, so maybe there's no need to hurry on this).

iangow commented 3 years ago

Here is text from the monograph:

# excerpt from Microsoft Corporation's 2016 10-K.
text = """We acquire other companies and intangible assets and may not realize all the economic benefit from those acquisitions, which could cause an impairment of goodwill or intangibles. We review our amortizable intangible assets for impairment when events or changes in circumstances indicate the carrying value may not be recoverable. We test goodwill for impairment at least annually. Factors that may be a change in circumstances, indicating that the carrying value of our goodwill or amortizable intangible assets may not be recoverable, include a decline in our stock price and market capitalization, reduced future cash flow estimates, and slower growth rates in industry segments in which we participate. We may be required to record a significant charge on our consolidated financial statements during the period in which any impairment of our goodwill or amortizable intangible assets is determined, negatively affecting our results of operations."""
iangow commented 3 years ago

From p.159 of ABCL:

Conveniently, Python includes a list of punctuation characters; looking at code from ABCL, it seems we'd only need to add the "curly" apostrophe to that list.

# Python includes a collection of all punctuation 
# characters
from string import punctuation

# add apostrophe to the punctuation character list
punctuation_w_apostrophe = punctuation + "’"

# print all characters
print(punctuation_w_apostrophe)

# Out:
# !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~’

I think using something pretty standard is good.

yiyangw2 commented 3 years ago

I am also thinking about making changes to the count number function. Currently, (1) it takes 20000 as a year (becasue '20000' contains 2000), and (2) does not count a number that appears at the end of a text unless the number is followed by a black. I make the following changes:

def number_count(doc):

  doc = re.sub('(?!=[0-9])(\.|,)(?=[0-9])', '', doc)
  doc = doc.translate(str.maketrans(string.punctuation, " " * len(string.punctuation)))

  doc = re.findall(r'\b[-+\(]?[$€£]?[-+(]?\d+\)?\b', doc)
  doc = [x for x in doc if not re.match(r'(199|20[01])\d{1}?\b', x)]
  return len(doc)