mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Handle alternative entries for Boolean variables in the xml documents #44

Closed bdcallen closed 5 years ago

bdcallen commented 5 years ago

@iangow There are also some filings, like this one, for which the entries for Boolean variables are written as strings such "true" and "false", rather than the usual 0/1 format

<reportingOwnerRelationship>
<isDirector>false</isDirector>
<isOfficer>false</isOfficer>
<isTenPercentOwner>true</isTenPercentOwner>
<isOther>false</isOther>

I will amend the code to handle these cases.

iangow commented 5 years ago

Depending on how these are parsed (as strings or integers for 0/1), as simple as.logical() may work to handle both cases.

> as.logical("true")
[1] TRUE
> as.logical("false")
[1] FALSE
> as.logical("1")
[1] NA
> as.logical(1)
[1] TRUE
> as.logical(0)
[1] FALSE
> as.logical("0")
[1] NA
bdcallen commented 5 years ago

@iangow I also realised the above when using as.logical. The function I wrote string_to_boolean handles both these cases

string_to_boolean <- function(string) {

    # first strip spaces from string

    reduced_string <- gsub("[ \t\n\r]", '', string)

    if(grepl("^[01]$", reduced_string)) {

        return(as.logical(as.integer(reduced_string)))

    } else {

        return(as.logical(reduced_string))

    }

}

Furthermore, I use string_to_boolean on the logical columns in the functions which scrape these dataframes for the tables (get_header, get_derivative_df, get_nonDerivative_df, and so on), to make the appropriate conversions.

Thus, we can close this one as well now.