mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Look into issue with forms345/process_345_xml_documents.R #99

Closed iangow closed 3 years ago

iangow commented 3 years ago

@bdcallen Seems not to be working.

bdcallen commented 3 years ago

@iangow Ok, so I spent last night looking at this and I spotted the error, and spent time fixing it. The issue seemed to be an error with bind_rows between dataframes scraped from the data, and an empty dataframe typically defined by the type of code below

column_names <- c(col1, col2, col3, ..., coln)
df <- data.frame(matrix(nrow=0, ncol=n), stringsAsFactors = FALSE)
colnames(df) <- 

The problem was that the columns of df were taken implicitly by R now to be of type logical, whereas the analogous columns from dataframes with scraped data were of type character. Thus what I did was I defined a new function, make_empty_dataframe_w_colnames

make_empty_dataframe_w_colnames <- function(column_names) {

    num_cols = length(column_names)

    empty_df <- data.frame(matrix(nrow = 0, ncol = num_cols), stringsAsFactors = FALSE)

    colnames(empty_df) <- column_names

    for (column in column_names) {

        # Initialize the columns to be character, so that raw data can be written in

        empty_df[, column] <- as.character(empty_df[, column])

    }

    return(empty_df)

}

which defines an empty dataframe with the column_names but with all columns typecast to character. I then rewrote all the functions defining the dataframes to be written to the tables in the database in terms of this function, removing the kind of snippets defining the logical empty dataframes above. After that, I ran the program and it has worked well

(base) bdcallen@igow-z640:~/edgar$ forms345/update_forms_345_tables.sh

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

Loading required package: xml2

Attaching package: ‘tidyr’

The following object is masked from ‘package:RCurl’:

    complete

[1] 0
[1] "Total time taken: \n"
    user   system  elapsed 
3845.981  561.281  457.426 
[1] "Number of full successes: \n"
[1] 10000
[1] "Number of filings processed: \n"
[1] 10000
Error in UseMethod("xpathApply") : 
  no applicable method for 'xpathApply' applied to an object of class "logical"
Error in UseMethod("xpathApply") : 
  no applicable method for 'xpathApply' appl ....
.
.
.
[1] "Number of full successes: \n"
[1] 159984
[1] "Number of filings processed: \n"
[1] 160000
[1] "Total time taken: \n"
     user    system   elapsed 
63902.187  9132.209  8408.463 
[1] "Number of full successes: \n"
[1] 165297
[1] "Number of filings processed: \n"
[1] 165313
(base) bdcallen@igow-z640:~/edgar$ 

Don't worry too much about the error messages here, they're error messages from bad cases I'm pretty sure.

I'll commit the new code and close shortly.