iangow / se_features

Linguistic features derived from StreetEvents
1 stars 3 forks source link

Create and test fog function. #9

Closed iangow closed 5 years ago

iangow commented 5 years ago

Related to #4.

iangow commented 5 years ago

@danielacarrasco

I think we should set up #4 to run with the function as is. No need to run it just yet, but just set it up to run with the current fog function as it.

In finance and accounting research, people have used a function from a Perl module. However, in writing this paper, I discovered an issue with this Perl module (see Appendix B).

Recently, the author of that Perl module fixed this issue:

From: Kim Ryan [mailto:kimryan@bigpond.net.au] Sent: Thursday, November 01, 2018 6:19 AM To: Bushee, Brian J bushee@wharton.upenn.edu Subject: Re: Linguistic Complexity in Firm paper

Hi Brian,

Hope you are the best contact for this. I noticed your joint above paper which uses the perl module I developed, Lingua-EN-Fathom. I saw the problems you detected with accurate sentence counting. Adopted your suggestion of splitting sentence with the more reliable method found in Lingua-EN-Sentence. Changes are in the latest version: https://metacpan.org/release/Lingua-EN-Fathom. Calculations are now close to yours in appendix B but not identical. I think this may be due to some upgrades to other modules.

Thanks for using CPAN modules and helping to improve them. Hope Perl is still useful to you. Many people seemed to have moved on to other languages sadly.

Regards,

Kim Ryan

So I think the Perl module might provide a useful target for us. I don't think we want to worry about getting exactly the same answers as the Perl module, but if we're very different, then I guess we should have a good reason for this.

I installed the current version of the Perl module on the MCCGR server:

igow@igow-z640:~$ sudo perl -MCPAN -e shell
Terminal does not support AddHistory.

cpan shell -- CPAN exploration and modules installation (v2.18)
Enter 'h' for help.

cpan[1]> install Lingua::EN::Fathom
Reading '/home/igow/.local/share/.cpan/Metadata'
  Database was generated on Tue, 14 Nov 2017 00:41:02 GMT
Fetching with LWP:

Skip a lot of messages

Running Build install
Building Lingua-EN-Fathom
Installing /usr/local/share/perl/5.26.1/Lingua/EN/Fathom.pm
Installing /usr/local/man/man3/Lingua::EN::Fathom.3pm
  KIMRYAN/Lingua-EN-Fathom-1.22.tar.gz
  ./Build install  -- OK

cpan[2]> exit
Terminal does not support GetHistory.
Lockfile removed.

I then ran the latest code from here. This creates a fog function the database that uses the Perl module. This code can be called using regular SQL:

crsp=# SELECT (fog_data(speaker_text)).*
FROM streetevents.speaker_data 
LIMIT 1;
 fog  | num_words | percent_complex | num_sentences | fog_original | num_sentences_original | fog  
------+-----------+-----------------+---------------+--------------+------------------------+------
 10.4 |        50 |              16 |             5 |         10.4 |                      5 | 10.4
(1 row)

I think it might make sense to grab a sample of rows from streetevents.speaker_data, calculate the fog statistics using the Perl module (using SQL like that above) and then do comparable calculations for the Python code.

It probably makes sense to start with pretty short strings:

crsp=# SELECT speaker_text FROM streetevents.speaker_data
WHERE length(speaker_text) < 100 
LIMIT 10;
                                            speaker_text                                            
----------------------------------------------------------------------------------------------------
 Okay, great, thanks, guys.
 Thank you. Morris Ajzenman from Griffin Securities, your line is open.
 Hi, guys.
 Hey, Morris.
 Right, that's right.
 Okay. Now, 28% of revenues internationally fourth quarter, I missed it, what was it for full year?
 I didn't provide the full year. I just said that they were down --
 No, no, no, as a percent of revenues for the full year?
 In the European marketplace?
 Yes, Europe, 11 --
(10 rows)
danielacarrasco commented 5 years ago

For the fog measure, do we want to create our own function or just use some library? the reading sentences and words can be done again with NTLK, but counting the syllables seems to be trickier. I have found a few resources for NTLK but they depend on cmudict, which has quite a few words missing. I can modify it so it's as close as the Perl function.

Otherwise, I've found code that's already written. For example, https://pypi.org/project/textstat/ not only has syllable count, but also a Fog Scale. It uses Pyphen https://pyphen.org , which can be used to count syllables.

danielacarrasco commented 5 years ago

@iangow I have written a basic syllable counter using pyphen for the fog function. I wanted to test it now but can't run stuff on the server. I get the following message,

could not connect to server: Connection refused Is the server running on host "10.101.13.99" and accepting TCP/IP connections on port 5432?

I got the same error when trying to run ./word_count_run.py and ./liwc_run.py which were working before. I also tried to see the databases with pgAdmin as well but had the same issue.

iangow commented 5 years ago

Ok. This is due to an upgrade. I will fix tomorrow.

iangow commented 5 years ago

This should be fixed now. Please don't worry about creating a particularly "good" fog function. Something using NLTK and cmudict would be fine.

iangow commented 5 years ago

Below is some old code I have. Some parts of this are specific to Python 2 and other parts relate to its being run inside a database:

 CREATE OR REPLACE FUNCTION public.fog_python(the_text text)                                          
  RETURNS double precision                                                                            
  LANGUAGE plpythonu                                                                                  
 AS $function$                                                                                        
     if 'nsyl' in SD:                                                                                 
         nsyl = SD['nsyl']                                                                            
         re = SD['re']                                                                                
         nltk = SD['nltk']                                                                            
         dic = SD['dic']                                                                              
     else:                                                                                            
         import re, nltk                                                                              

         from nltk.corpus import cmudict                                                              
         dic = cmudict.dict()                                                                         

         def nsyl(word):                                                                              
             if word in dic:                                                                          
                 prons = dic[word]                                                                    
                 num_syls = [len([syl for syl in pron if re.findall('[0-9]', syl)]) for pron in prons]
                 return max(num_syls)                                                                 

         SD['nsyl'] = nsyl                                                                            
         SD['re'] = re                                                                                
         SD['nltk'] = nltk                                                                            
         SD['dic'] = dic                                                                              

     sents = nltk.sent_tokenize(the_text.decode('utf8'))                                              
     words = [word.lower() for sent in sents                                                          
                 for word in nltk.word_tokenize(sent) if re.findall('[a-zA-Z]', word)]                

     # Require words to be more than three characters. Otherwise, "edu"="E-D-U" => 3 syllables        
     complex_words = [word for word in words if nsyl(word)>=3 and len(word)>3]                        
     if len(words)>0 and len(sents)>0:                                                                
         fog = 0.4 * (100.0*len(complex_words)/len(words) + 1.0*len(words)/len(sents))                
         return(fog)                                                                                  

 $function$  
iangow commented 5 years ago

Below is a version that would be more appropriate for running as a standalone Python function. I think that the_text.decode('utf8') may not be necessary (and may actually cause problems) with Python 3, as everything is Unicode in Python 3 (as is everything in PostgreSQL).

import re, nltk
from nltk.corpus import cmudict
dic = cmudict.dict()

def nsyl(word):
    if word in dic:
        prons = dic[word]
        num_syls = [len([syl for syl in pron if re.findall('[0-9]', syl)]) for pron in prons]
        return max(num_syls)

def fog(the_text):
    sents = nltk.sent_tokenize(the_text.decode('utf8'))
    words = [word.lower() for sent in sents
                for word in nltk.word_tokenize(sent) if re.findall('[a-zA-Z]', word)]

    # Require words to be more than three characters. Otherwise, "edu"="E-D-U" => 3 syllables
    complex_words = [word for word in words if nsyl(word)>=3 and len(word)>3]
    if len(words)>0 and len(sents)>0:
        fog = 0.4 * (100.0*len(complex_words)/len(words) + 1.0*len(words)/len(sents))
        return(fog)
iangow commented 5 years ago

@danielacarrasco

Here is a quick Python notebook I made to test this code.

I removed the decode('utf8') as discussed. Also I found I needed to avoid None as a return value from nsyl(word), but this may be OK. I think this code might be fine for current purposes; we can revisit this once some downstream steps are taken care of (it would be good to compile an initial set of features and do some preliminary machine learning work).

danielacarrasco commented 5 years ago

I committed some changes last night where I calculate the fog function using pyphen. I'm planning to test it today/tomorrow against the Perl code you mentioned before, but it seems to do a good job and has fewer problems with unrecognised words. I am happy to use what you posted now if you prefer that though.

iangow commented 5 years ago

I updated the Python notebook to pull a bigger random sample.

Perhaps take a quick look at both the function I provided and the one using pyphen. The Perl function is a "target" of sorts, so if your function gets closer to that one (in terms of correlation or other measures), then perhaps use that. A sample of 100 or 1000 would be fine for evaluating this.

Ideally the fog function I have here would be sitting in a separate .py module and be imported, but I figured including inline was fine for initial testing.

danielacarrasco commented 5 years ago

I've been testing the 2 codes and the Perl function. The results with your code are almost always equal to the Perl function. And mine and yours are very similar, although when they differ, mine is a bit lower (no more than 10%). I had a look at a couple of examples and sometimes the Perl function is way off, but normally it's when there are only a few words. I think pyphen does a better job in those cases.

I think since you want to start with the machine learning analysis soon, I can commit your function and then generate the tables with that (tomorrow). Otherwise, I am happy to keep investigating the differences.

iangow commented 5 years ago

@danielacarrasco Sounds good.

Let's do whatever is easiest at this stage. If the pyphen function is better and it isn't a whole lot slower, that might be best. For the current exercise, precision is not too important and either option is probably fine; for later exercises, it may be, but we can deal with that then.

danielacarrasco commented 5 years ago

@iangow I'll use your function for now because it seems to be faster and easier to implement. I do want to look at the differences in the future though.

I have a couple of questions:


Hopefully I'll leave the tables running to be created tomorrow morning

iangow commented 5 years ago

Better to use speaker_number rather than speaker_name. Yes, please add those other fields. More features better.

danielacarrasco commented 5 years ago

@iangow I think I have finished this issue. I added the speaker_number to the word_counts table, fixed the timestamp issue I mentioned, and I am generating the fog_measure table now. It looks like this:

file_name | last_update | speaker_name | speaker_number | context | section | count | sum | sent_count | sum_6 | sum_num | fog

(basically the word_counts table + the fog measure)

I can start with the ML analysis if you give me more details (in a new issue probably). In the meantime I can update the README.

iangow commented 5 years ago

Please drop speaker_name from this table, both in the code and in the table itself.

It may be necessary to drop the primary key constraint to do this:

ALTER TABLE se_features.fog_measure DROP CONSTRAINT fog_measure_pkey;
ALTER TABLE se_features.fog_measure DROP COLUMN speaker_name;
ALTER TABLE se_features.fog_measure
    ADD CONSTRAINT fog_measure_pkey PRIMARY KEY (file_name, last_update, speaker_number, context, section);
iangow commented 5 years ago

@danielacarrasco

Do you have any output or data from your analysis of how the various approaches to fog compared?

danielacarrasco commented 5 years ago

Sorry - I had missed this message. These are some numbers from different tables I got

Speaker_text file_name speaker_number IG_fog DCN_fog Perl Eye
0 3661108_T 106 5.70 5.70    
1 2290236_T 95 18.95 18.95   6
2 2540039_T 59 0.80 0.80    
3 4907587_T 95 4.45 4.39 4.45  
4 5174927_T 7 5.33 3.73 5.3 4.7
5 932721_T 17 12.21 12.84 12.21  
6 1112453_T 58 12.68 9.15 12.68  
7 5221351_T 23 0.40 0.40   0.4
8 1818015_T 6 11.48 10.28 11.48  
9 4185949_T 12 13.90 12.86 13.90  
10 5272272_T 18 13.73 12.55 13.73  
11 3800150_T 313 1.20 1.20    
12 11300202_T 16 8.47 8.37 8.47  
13 5079313_T 8 8.93 7.16 8.93  
15 1198473_T 59 10.12 8.85 10.12  
16 1641534_T 107 1.80 1.60 1.8  
19 5766075_T 10 18.20 13.06 18.2  
23 3109648_T 27 18.11 17.35 18.1 20.6

IC_fog = using NLTK DC_fog = using pyphen Eye = What I calculated by looking at the call.

iangow commented 5 years ago

@danielacarrasco Do you have code to create the table above (except perhaps Eye, I guess)? An IPython notebook would be fine.