Closed iangow closed 4 years ago
@iangow Processing the first chunk of code from extract_cusips.R
#!/usr/bin/env Rscript
>
> # Get a list of files that need to be processed ----
>
> library("RPostgreSQL")
> pg <- dbConnect(PostgreSQL())
>
> if (!dbExistsTable(pg, c("edgar", "cusip_cik"))) {
+ dbGetQuery(pg, "
+ CREATE TABLE edgar.cusip_cik
+ (
+ file_name text,
+ cusip text,
+ cik integer,
+ company_name text,
+ format text
+ )
+
+ GRANT SELECT ON TABLE edgar.cusip_cik TO crsp_basic;
+
+ CREATE INDEX ON edgar.cusip_cik (cusip);
+ CREATE INDEX ON edgar.cusip_cik (cik);")
+ }
>
> # Note that this assumes that streetevents.calls is up to date.
> file_list <- dbGetQuery(pg, "
+ SET work_mem='2GB';
+
+ SELECT file_name
+ FROM edgar.filings
+ WHERE form_type IN ('SC 13G', 'SC 13G/A', 'SC 13D', 'SC 13D/A')
+ EXCEPT
+ SELECT file_name
+ FROM edgar.cusip_cik
+ ORDER BY file_name")
>
> rs <- dbDisconnect(pg)
>
> # Create function to parse a SC 13D or SC 13F filing ----
> parseFile <- function(file_name) {
+
+ # Parse the indicated file using a Perl script
+ system(paste("extract_cusips.pl", file_name),
+ intern = TRUE)
+ }
> head(file_list)
file_name
1 edgar/data/1000045/0000315066-18-001444.txt
2 edgar/data/1000045/0000897069-15-000319.txt
3 edgar/data/1000045/0000930413-17-000708.txt
4 edgar/data/1000045/0001037389-17-000038.txt
5 edgar/data/1000045/0001037389-18-000160.txt
6 edgar/data/1000045/0001104659-17-009276.txt
yielded file_list
just fine.
However, when I tried the latter chunk
library(parallel)
system.time({
res <- unlist(mclapply(file_list$file_name, parseFile, mc.cores=12))
})
I got
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
sh: 1: extract_cusips.pl: not found
user system elapsed
0.090 0.757 0.217
Warning message:
In mclapply(file_list$file_name, parseFile, mc.cores = 12) :
all scheduled cores encountered errors in user code
I figured this had something to do with the Perl script extract_cusips.pl
, so I then tried a few commands in terminal
bdcallen@igow-z640:~/edgar$ extract_cusips.pl edgar/data/1000045/0000315066-18-001444.txt
extract_cusips.pl: command not found
bdcallen@igow-z640:~/edgar$ perl extract_cusips.pl edgar/data/1000045/0000315066-18-001444.txt
Can't locate HTML/Strip.pm in @INC (you may need to install the HTML::Strip module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.0 /usr/local/share/perl/5.26.0 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at extract_cusips.pl line 3.
BEGIN failed--compilation aborted at extract_cusips.pl line 3.
Do Perl programs have to be compiled in some way? Do you know what is going on in the above?
@iangow Extended test.pl to the following
#!/usr/bin/env perl
use DBI;
use HTML::Strip;
use File::stat;
use Env qw($PGDATABASE $PGUSER $PGHOST $EDGAR_DIR);
$path_to_edgar = $EDGAR_DIR ? $EDGAR_DIR : "/Volumes/2TB/data/";
print $path_to_edgar;
print "\n";
# Connect to my database
$PGDATABASE = $PGDATABASE ? $PGDATABASE : "crsp";
$PGUSER = $PGUSER ? $PGUSER : "igow";
$PGHOST= $PGHOST ? $PGHOST : "localhost";
print $PGDATABASE;
print "\n";
print $PGUSER;
print "\n";
print $PGHOST;
print "\n";
$file_name = @ARGV[0];
print $file_name;
print "\n";
my $dbh = DBI->connect("dbi:Pg:dbname=$PGDATABASE;host=$PGHOST", "$PGUSER")
or die "Cannot connect: " . $DBI::errstr;
print $dbh;
print "\n";
# Get the file name
$full_path = $path_to_edgar . '/' . $file_name ;
print $full_path;
print "\n";
if(-e $full_path) {
print "file exists \n";
} else {
print "file does not exist \n";
}
my $filesize = stat($full_path)->size;
print $filesize;
print"\n";
print "Hello World!\n";
and I got the output
/Volumes/2TB/data/
crsp
bdcallen
10.101.13.99
edgar/data/1072006/0001072006-01-500007.txt
DBI::db=HASH(0x55834575b290)
/Volumes/2TB/data//edgar/data/1072006/0001072006-01-500007.txt
file does not exist
Can't call method "size" on an undefined value at test.pl line 48.
Seems the files are not being found.
It's because the environment variable EDGAR_DIR
is not set.
It should be /home/shared
. At the command line it can be set thus:
igow@igow-z640:/home/shared$ export EDGAR_DIR=/home/shared
igow@igow-z640:/home/shared$ echo $EDGAR_DIR
/home/shared
In R use Sys.setenv(EDGAR_DIR = "/home/shared")
.
I successfully set EDGAR_DIR
, but still getting that the file does not exist
/home/shared
/home/shared
crsp
bdcallen
10.101.13.99
edgar/data/1218315/0000932799-06-000103.txt
DBI::db=HASH(0x56522ee80870)
/home/shared/edgar/data/1218315/0000932799-06-000103.txt
file does not exist
Can't call method "size" on an undefined value at test.pl line 51.
Have had a look in /home/shared/edgar/data and 1218315 does indeed not exist. I'm guessing we need to run get_13D_filings.R
first?
Yes. You will need to set EDGAR_DIR
first.
@iangow Just committed some of the functions I've written in Python. Here's the cusip scraping function, atm
def get_cusip_cik(file_name):
url = get_filing_txt_url(file_name)
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
text = soup.getText()
cusip_hdr = r'CUSIP\s+(?:No\.|#|Number):?'
cusip_fmt = r'([0-9A-Z]{1,3}[\s-]?[0-9A-Z]{3,5}[\s-]?[0-9A-Z]{2}[\s-]?[0-9A-Z]{1}|' \
+ '[0-9A-Z]{4,6}[\s-]?[0-9A-Z]{2}[\s-]?[0-9A-Z]{1})'
regex_dict = {'A': cusip_hdr + '[\t\r\n\s]+' + cusip_fmt,
'B': cusip_fmt + r'[\n]?[_-]?\s+(?:[_-]{9,})?[\s\r\t\n]*\(CUSIP Number\)',
'C': cusip_fmt + '[\s\t\r]*[\n]?' + '[\s\t\r]*\(CUSIP Number of Class of Securities\)'
}
df_list = []
for key, regex in regex_dict.items():
matches = re.findall(regex, text)
cusips = [re.sub('[^0-9A-Z]', '', re.search(cusip_fmt, match).group(0)) for match in matches]
if(len(cusips)):
df = pd.DataFrame({'cusip': cusips})
df['format'] = key
df['file_name'] = file_name
df = df[["file_name", "cusip", "format"]]
else:
df = pd.DataFrame({"file_name": [], "cusip": [], "format": []})
df_list.append(df)
full_df = pd.concat(df_list)
formats = full_df.groupby('cusip').apply(lambda x: ''.join(x['format'].unique().tolist()))
full_df['formats'] = full_df['cusip'].apply(lambda x: formats[x])
full_df = full_df[['file_name', 'cusip', 'formats']]
full_df = full_df.drop_duplicates()
return(full_df)
As you can see, it now uses a dictionary regex_dict
, so a continuing issue is to add more regular expressions to it as we find them.. Note in regex_dict
, I've redefined the pattern that was previously format = 'D'
as 'B'. Also, the final dataframe has formats
as a string rather than just format
: here, formats
contains the list of unique regular expressions matched for the cusip written as a string, so if for some entry formats = 'AB'
, it means matches were found for the given cusip for both regular expressions A
and B
.
@iangow I am currently giving the program extract_cusips.py
a trial run. It has successfully processed about 16000 filings so far, and I'm going to let it run over night. I think it should hit around 100000 filings by around midday tomorrow.
I think the main thing to check would be how it classifies compared to the earlier Perl code (using existing data tables for the latter).
@iangow I think we can close this now, as I let the program go to completion. Perhaps we should make a follow on issue for adding a cronjob for this table.
I copied data over from the "old" EDGAR repository (
filings
) to the new repository (edgar
). We need to move the code over to. This might be a good opportunity to clean up the code.The current code is here. This uses Perl. Maybe we won't bother to change that part ... I think the first thing is just to document what the code is doing.