Closed iangow closed 5 years ago
@danielacarrasco
I think we should set up #4 to run with the function as is. No need to run it just yet, but just set it up to run with the current fog
function as it.
In finance and accounting research, people have used a function from a Perl module. However, in writing this paper, I discovered an issue with this Perl module (see Appendix B).
Recently, the author of that Perl module fixed this issue:
From: Kim Ryan [mailto:kimryan@bigpond.net.au] Sent: Thursday, November 01, 2018 6:19 AM To: Bushee, Brian J bushee@wharton.upenn.edu Subject: Re: Linguistic Complexity in Firm paper
Hi Brian,
Hope you are the best contact for this. I noticed your joint above paper which uses the perl module I developed, Lingua-EN-Fathom. I saw the problems you detected with accurate sentence counting. Adopted your suggestion of splitting sentence with the more reliable method found in Lingua-EN-Sentence. Changes are in the latest version: https://metacpan.org/release/Lingua-EN-Fathom. Calculations are now close to yours in appendix B but not identical. I think this may be due to some upgrades to other modules.
Thanks for using CPAN modules and helping to improve them. Hope Perl is still useful to you. Many people seemed to have moved on to other languages sadly.
Regards,
Kim Ryan
So I think the Perl module might provide a useful target for us. I don't think we want to worry about getting exactly the same answers as the Perl module, but if we're very different, then I guess we should have a good reason for this.
I installed the current version of the Perl module on the MCCGR server:
igow@igow-z640:~$ sudo perl -MCPAN -e shell
Terminal does not support AddHistory.
cpan shell -- CPAN exploration and modules installation (v2.18)
Enter 'h' for help.
cpan[1]> install Lingua::EN::Fathom
Reading '/home/igow/.local/share/.cpan/Metadata'
Database was generated on Tue, 14 Nov 2017 00:41:02 GMT
Fetching with LWP:
Skip a lot of messages
Running Build install
Building Lingua-EN-Fathom
Installing /usr/local/share/perl/5.26.1/Lingua/EN/Fathom.pm
Installing /usr/local/man/man3/Lingua::EN::Fathom.3pm
KIMRYAN/Lingua-EN-Fathom-1.22.tar.gz
./Build install -- OK
cpan[2]> exit
Terminal does not support GetHistory.
Lockfile removed.
I then ran the latest code from here. This creates a fog
function the database that uses the Perl module. This code can be called using regular SQL:
crsp=# SELECT (fog_data(speaker_text)).*
FROM streetevents.speaker_data
LIMIT 1;
fog | num_words | percent_complex | num_sentences | fog_original | num_sentences_original | fog
------+-----------+-----------------+---------------+--------------+------------------------+------
10.4 | 50 | 16 | 5 | 10.4 | 5 | 10.4
(1 row)
I think it might make sense to grab a sample of rows from streetevents.speaker_data
, calculate the fog statistics using the Perl module (using SQL like that above) and then do comparable calculations for the Python code.
It probably makes sense to start with pretty short strings:
crsp=# SELECT speaker_text FROM streetevents.speaker_data
WHERE length(speaker_text) < 100
LIMIT 10;
speaker_text
----------------------------------------------------------------------------------------------------
Okay, great, thanks, guys.
Thank you. Morris Ajzenman from Griffin Securities, your line is open.
Hi, guys.
Hey, Morris.
Right, that's right.
Okay. Now, 28% of revenues internationally fourth quarter, I missed it, what was it for full year?
I didn't provide the full year. I just said that they were down --
No, no, no, as a percent of revenues for the full year?
In the European marketplace?
Yes, Europe, 11 --
(10 rows)
For the fog measure, do we want to create our own function or just use some library? the reading sentences and words can be done again with NTLK, but counting the syllables seems to be trickier. I have found a few resources for NTLK but they depend on cmudict
, which has quite a few words missing. I can modify it so it's as close as the Perl function.
Otherwise, I've found code that's already written. For example, https://pypi.org/project/textstat/ not only has syllable count, but also a Fog Scale. It uses Pyphen https://pyphen.org , which can be used to count syllables.
@iangow I have written a basic syllable counter using pyphen
for the fog function. I wanted to test it now but can't run stuff on the server. I get the following message,
could not connect to server: Connection refused Is the server running on host "10.101.13.99" and accepting TCP/IP connections on port 5432?
I got the same error when trying to run ./word_count_run.py
and ./liwc_run.py
which were working before.
I also tried to see the databases with pgAdmin as well but had the same issue.
Ok. This is due to an upgrade. I will fix tomorrow.
This should be fixed now. Please don't worry about creating a particularly "good" fog function. Something using NLTK and cmudict
would be fine.
Below is some old code I have. Some parts of this are specific to Python 2 and other parts relate to its being run inside a database:
CREATE OR REPLACE FUNCTION public.fog_python(the_text text)
RETURNS double precision
LANGUAGE plpythonu
AS $function$
if 'nsyl' in SD:
nsyl = SD['nsyl']
re = SD['re']
nltk = SD['nltk']
dic = SD['dic']
else:
import re, nltk
from nltk.corpus import cmudict
dic = cmudict.dict()
def nsyl(word):
if word in dic:
prons = dic[word]
num_syls = [len([syl for syl in pron if re.findall('[0-9]', syl)]) for pron in prons]
return max(num_syls)
SD['nsyl'] = nsyl
SD['re'] = re
SD['nltk'] = nltk
SD['dic'] = dic
sents = nltk.sent_tokenize(the_text.decode('utf8'))
words = [word.lower() for sent in sents
for word in nltk.word_tokenize(sent) if re.findall('[a-zA-Z]', word)]
# Require words to be more than three characters. Otherwise, "edu"="E-D-U" => 3 syllables
complex_words = [word for word in words if nsyl(word)>=3 and len(word)>3]
if len(words)>0 and len(sents)>0:
fog = 0.4 * (100.0*len(complex_words)/len(words) + 1.0*len(words)/len(sents))
return(fog)
$function$
Below is a version that would be more appropriate for running as a standalone Python function. I think that the_text.decode('utf8')
may not be necessary (and may actually cause problems) with Python 3, as everything is Unicode in Python 3 (as is everything in PostgreSQL).
import re, nltk
from nltk.corpus import cmudict
dic = cmudict.dict()
def nsyl(word):
if word in dic:
prons = dic[word]
num_syls = [len([syl for syl in pron if re.findall('[0-9]', syl)]) for pron in prons]
return max(num_syls)
def fog(the_text):
sents = nltk.sent_tokenize(the_text.decode('utf8'))
words = [word.lower() for sent in sents
for word in nltk.word_tokenize(sent) if re.findall('[a-zA-Z]', word)]
# Require words to be more than three characters. Otherwise, "edu"="E-D-U" => 3 syllables
complex_words = [word for word in words if nsyl(word)>=3 and len(word)>3]
if len(words)>0 and len(sents)>0:
fog = 0.4 * (100.0*len(complex_words)/len(words) + 1.0*len(words)/len(sents))
return(fog)
@danielacarrasco
Here is a quick Python notebook I made to test this code.
I removed the decode('utf8')
as discussed. Also I found I needed to avoid None
as a return value from nsyl(word)
, but this may be OK. I think this code might be fine for current purposes; we can revisit this once some downstream steps are taken care of (it would be good to compile an initial set of features and do some preliminary machine learning work).
I committed some changes last night where I calculate the fog function using pyphen
. I'm planning to test it today/tomorrow against the Perl code you mentioned before, but it seems to do a good job and has fewer problems with unrecognised words. I am happy to use what you posted now if you prefer that though.
I updated the Python notebook to pull a bigger random sample.
Perhaps take a quick look at both the function I provided and the one using pyphen
. The Perl function is a "target" of sorts, so if your function gets closer to that one (in terms of correlation or other measures), then perhaps use that. A sample of 100 or 1000 would be fine for evaluating this.
Ideally the fog
function I have here would be sitting in a separate .py
module and be imported, but I figured including inline was fine for initial testing.
I've been testing the 2 codes and the Perl function. The results with your code are almost always equal to the Perl function. And mine and yours are very similar, although when they differ, mine is a bit lower (no more than 10%). I had a look at a couple of examples and sometimes the Perl function is way off, but normally it's when there are only a few words. I think pyphen
does a better job in those cases.
I think since you want to start with the machine learning analysis soon, I can commit your function and then generate the tables with that (tomorrow). Otherwise, I am happy to keep investigating the differences.
@danielacarrasco Sounds good.
Let's do whatever is easiest at this stage. If the pyphen
function is better and it isn't a whole lot slower, that might be best. For the current exercise, precision is not too important and either option is probably fine; for later exercises, it may be, but we can deal with that then.
@iangow I'll use your function for now because it seems to be faster and easier to implement. I do want to look at the differences in the future though.
I have a couple of questions:
I realised we only have the speaker number only in the liwc
table, but not in the word_counts
table. Do you want me to add them?
The program is giving me an error which I suspect is from an update
TypeError: dtype '<class 'pandas._libs.tslibs.timestamps.Timestamp'>' not understood
It comes from
File "/home/dcarrasco/se_features/word_count/word_count_add.py", line 35, in add_word_counts speaker_data['last_update'] = speaker_data['last_update'].astype(pd.Timestamp)
which was previously working. I can look into it.
The header for the fog tables will be:
file_name | last_update | speaker_name | context | section | count | fog
Do you prefer I add the number of words, sentences, and percent_complex?
Hopefully I'll leave the tables running to be created tomorrow morning
Better to use speaker_number
rather than speaker_name
. Yes, please add those other fields. More features better.
@iangow I think I have finished this issue. I added the speaker_number to the word_counts
table, fixed the timestamp issue I mentioned, and I am generating the fog_measure
table now. It looks like this:
file_name | last_update | speaker_name | speaker_number | context | section | count | sum | sent_count | sum_6 | sum_num | fog
(basically the word_counts table + the fog measure)
I can start with the ML analysis if you give me more details (in a new issue probably). In the meantime I can update the README.
Please drop speaker_name
from this table, both in the code and in the table itself.
It may be necessary to drop the primary key constraint to do this:
ALTER TABLE se_features.fog_measure DROP CONSTRAINT fog_measure_pkey;
ALTER TABLE se_features.fog_measure DROP COLUMN speaker_name;
ALTER TABLE se_features.fog_measure
ADD CONSTRAINT fog_measure_pkey PRIMARY KEY (file_name, last_update, speaker_number, context, section);
@danielacarrasco
Do you have any output or data from your analysis of how the various approaches to fog compared?
Sorry - I had missed this message. These are some numbers from different tables I got
Speaker_text | file_name | speaker_number | IG_fog | DCN_fog | Perl | Eye |
---|---|---|---|---|---|---|
0 | 3661108_T | 106 | 5.70 | 5.70 | ||
1 | 2290236_T | 95 | 18.95 | 18.95 | 6 | |
2 | 2540039_T | 59 | 0.80 | 0.80 | ||
3 | 4907587_T | 95 | 4.45 | 4.39 | 4.45 | |
4 | 5174927_T | 7 | 5.33 | 3.73 | 5.3 | 4.7 |
5 | 932721_T | 17 | 12.21 | 12.84 | 12.21 | |
6 | 1112453_T | 58 | 12.68 | 9.15 | 12.68 | |
7 | 5221351_T | 23 | 0.40 | 0.40 | 0.4 | |
8 | 1818015_T | 6 | 11.48 | 10.28 | 11.48 | |
9 | 4185949_T | 12 | 13.90 | 12.86 | 13.90 | |
10 | 5272272_T | 18 | 13.73 | 12.55 | 13.73 | |
11 | 3800150_T | 313 | 1.20 | 1.20 | ||
12 | 11300202_T | 16 | 8.47 | 8.37 | 8.47 | |
13 | 5079313_T | 8 | 8.93 | 7.16 | 8.93 | |
15 | 1198473_T | 59 | 10.12 | 8.85 | 10.12 | |
16 | 1641534_T | 107 | 1.80 | 1.60 | 1.8 | |
19 | 5766075_T | 10 | 18.20 | 13.06 | 18.2 | |
23 | 3109648_T | 27 | 18.11 | 17.35 | 18.1 | 20.6 |
IC_fog = using NLTK DC_fog = using pyphen Eye = What I calculated by looking at the call.
@danielacarrasco Do you have code to create the table above (except perhaps Eye
, I guess)? An IPython notebook would be fine.
Related to #4.