mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Update edgar.cusip_cik. #11

Closed iangow closed 4 years ago

iangow commented 6 years ago

I copied data over from the "old" EDGAR repository (filings) to the new repository (edgar). We need to move the code over to. This might be a good opportunity to clean up the code.

The current code is here. This uses Perl. Maybe we won't bother to change that part ... I think the first thing is just to document what the code is doing.

crsp=# CREATE TABLE edgar.cusip_cik AS SELECT * FROM filings.cusip_cik;
SELECT 1198100
crsp=# ALTER TABLE edgar.cusip_cik OWNER TO edgar;
ALTER TABLE
crsp=# GRANT SELECT ON edgar.cusip_cik TO edgar_access;
GRANT
bdcallen commented 6 years ago

@iangow Processing the first chunk of code from extract_cusips.R

#!/usr/bin/env Rscript
> 
> # Get a list of files that need to be processed ----
> 
> library("RPostgreSQL")
> pg <- dbConnect(PostgreSQL())
> 
> if (!dbExistsTable(pg, c("edgar", "cusip_cik"))) {
+     dbGetQuery(pg, "
+         CREATE TABLE edgar.cusip_cik
+             (
+               file_name text,
+               cusip text,
+               cik integer,
+               company_name text,
+               format text
+             )
+ 
+         GRANT SELECT ON TABLE edgar.cusip_cik TO crsp_basic;
+ 
+         CREATE INDEX ON edgar.cusip_cik (cusip);
+         CREATE INDEX ON edgar.cusip_cik (cik);")
+ }
> 
> # Note that this assumes that streetevents.calls is up to date.
> file_list <- dbGetQuery(pg, "
+     SET work_mem='2GB';
+ 
+     SELECT file_name
+     FROM edgar.filings
+     WHERE form_type IN ('SC 13G', 'SC 13G/A', 'SC 13D', 'SC 13D/A')
+     EXCEPT
+     SELECT file_name
+     FROM edgar.cusip_cik
+     ORDER BY file_name")
> 
> rs <- dbDisconnect(pg)
> 
> # Create function to parse a SC 13D or SC 13F filing ----
> parseFile <- function(file_name) {
+ 
+     # Parse the indicated file using a Perl script
+     system(paste("extract_cusips.pl", file_name),
+            intern = TRUE)
+ }
> head(file_list)
                                    file_name
1 edgar/data/1000045/0000315066-18-001444.txt
2 edgar/data/1000045/0000897069-15-000319.txt
3 edgar/data/1000045/0000930413-17-000708.txt
4 edgar/data/1000045/0001037389-17-000038.txt
5 edgar/data/1000045/0001037389-18-000160.txt
6 edgar/data/1000045/0001104659-17-009276.txt

yielded file_list just fine.

However, when I tried the latter chunk

library(parallel)
system.time({
    res <- unlist(mclapply(file_list$file_name, parseFile, mc.cores=12))
})

I got

sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
   user  system elapsed 
  0.090   0.757   0.217 
Warning message:
In mclapply(file_list$file_name, parseFile, mc.cores = 12) :
  all scheduled cores encountered errors in user code

I figured this had something to do with the Perl script extract_cusips.pl, so I then tried a few commands in terminal

bdcallen@igow-z640:~/edgar$ extract_cusips.pl edgar/data/1000045/0000315066-18-001444.txt
extract_cusips.pl: command not found
bdcallen@igow-z640:~/edgar$ perl extract_cusips.pl edgar/data/1000045/0000315066-18-001444.txt
Can't locate HTML/Strip.pm in @INC (you may need to install the HTML::Strip module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.0 /usr/local/share/perl/5.26.0 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at extract_cusips.pl line 3.
BEGIN failed--compilation aborted at extract_cusips.pl line 3.

Do Perl programs have to be compiled in some way? Do you know what is going on in the above?

bdcallen commented 6 years ago

@iangow Extended test.pl to the following

#!/usr/bin/env perl
use DBI;
use HTML::Strip;
use File::stat;
use Env qw($PGDATABASE $PGUSER $PGHOST $EDGAR_DIR);

$path_to_edgar = $EDGAR_DIR ? $EDGAR_DIR : "/Volumes/2TB/data/";
print $path_to_edgar;
print "\n";

# Connect to my database
$PGDATABASE = $PGDATABASE ? $PGDATABASE : "crsp";
$PGUSER = $PGUSER ? $PGUSER : "igow";
$PGHOST= $PGHOST ? $PGHOST : "localhost";

print $PGDATABASE;
print "\n";
print $PGUSER;
print "\n";
print $PGHOST;
print "\n";

$file_name = @ARGV[0];

print $file_name;
print "\n";

my $dbh = DBI->connect("dbi:Pg:dbname=$PGDATABASE;host=$PGHOST", "$PGUSER")
    or die "Cannot connect: " . $DBI::errstr;

print $dbh;
print "\n";

# Get the file name
$full_path = $path_to_edgar . '/' . $file_name ;

print $full_path;
print "\n";

if(-e $full_path) {
    print "file exists \n";
} else {
    print "file does not exist \n";
}

my $filesize = stat($full_path)->size;

print $filesize;
print"\n";

print "Hello World!\n";

and I got the output

/Volumes/2TB/data/
crsp
bdcallen
10.101.13.99
edgar/data/1072006/0001072006-01-500007.txt
DBI::db=HASH(0x55834575b290)
/Volumes/2TB/data//edgar/data/1072006/0001072006-01-500007.txt
file does not exist
Can't call method "size" on an undefined value at test.pl line 48.

Seems the files are not being found.

iangow commented 6 years ago

It's because the environment variable EDGAR_DIR is not set.

It should be /home/shared. At the command line it can be set thus:

igow@igow-z640:/home/shared$ export EDGAR_DIR=/home/shared
igow@igow-z640:/home/shared$ echo $EDGAR_DIR
/home/shared

In R use Sys.setenv(EDGAR_DIR = "/home/shared").

bdcallen commented 6 years ago

I successfully set EDGAR_DIR, but still getting that the file does not exist

/home/shared
/home/shared
crsp
bdcallen
10.101.13.99
edgar/data/1218315/0000932799-06-000103.txt
DBI::db=HASH(0x56522ee80870)
/home/shared/edgar/data/1218315/0000932799-06-000103.txt
file does not exist
Can't call method "size" on an undefined value at test.pl line 51.

Have had a look in /home/shared/edgar/data and 1218315 does indeed not exist. I'm guessing we need to run get_13D_filings.R first?

iangow commented 6 years ago

Yes. You will need to set EDGAR_DIR first.

bdcallen commented 5 years ago

@iangow Just committed some of the functions I've written in Python. Here's the cusip scraping function, atm

def get_cusip_cik(file_name):

    url = get_filing_txt_url(file_name)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    text = soup.getText()

    cusip_hdr = r'CUSIP\s+(?:No\.|#|Number):?'
    cusip_fmt = r'([0-9A-Z]{1,3}[\s-]?[0-9A-Z]{3,5}[\s-]?[0-9A-Z]{2}[\s-]?[0-9A-Z]{1}|' \
                + '[0-9A-Z]{4,6}[\s-]?[0-9A-Z]{2}[\s-]?[0-9A-Z]{1})'

    regex_dict = {'A': cusip_hdr + '[\t\r\n\s]+' + cusip_fmt,
                  'B': cusip_fmt + r'[\n]?[_-]?\s+(?:[_-]{9,})?[\s\r\t\n]*\(CUSIP Number\)',
                  'C': cusip_fmt + '[\s\t\r]*[\n]?' + '[\s\t\r]*\(CUSIP Number of Class of Securities\)'
                 }

    df_list = []

    for key, regex in regex_dict.items():

        matches = re.findall(regex, text)

        cusips = [re.sub('[^0-9A-Z]', '', re.search(cusip_fmt, match).group(0)) for match in matches]

        if(len(cusips)):
            df = pd.DataFrame({'cusip': cusips})
            df['format'] = key
            df['file_name'] = file_name
            df = df[["file_name", "cusip", "format"]]

        else:
            df = pd.DataFrame({"file_name": [], "cusip": [], "format": []})

        df_list.append(df)

    full_df = pd.concat(df_list)

    formats = full_df.groupby('cusip').apply(lambda x: ''.join(x['format'].unique().tolist()))

    full_df['formats'] = full_df['cusip'].apply(lambda x: formats[x])

    full_df = full_df[['file_name', 'cusip', 'formats']]

    full_df = full_df.drop_duplicates()

    return(full_df)

As you can see, it now uses a dictionary regex_dict, so a continuing issue is to add more regular expressions to it as we find them.. Note in regex_dict, I've redefined the pattern that was previously format = 'D' as 'B'. Also, the final dataframe has formats as a string rather than just format: here, formats contains the list of unique regular expressions matched for the cusip written as a string, so if for some entry formats = 'AB', it means matches were found for the given cusip for both regular expressions A and B.

bdcallen commented 5 years ago

@iangow I am currently giving the program extract_cusips.py a trial run. It has successfully processed about 16000 filings so far, and I'm going to let it run over night. I think it should hit around 100000 filings by around midday tomorrow.

iangow commented 5 years ago

I think the main thing to check would be how it classifies compared to the earlier Perl code (using existing data tables for the latter).

bdcallen commented 4 years ago

@iangow I think we can close this now, as I let the program go to completion. Perhaps we should make a follow on issue for adding a cronjob for this table.