openva / crump

A parser for the Virginia State Corporation Commission's business registration records.
https://vabusinesses.org/
MIT License
20 stars 3 forks source link

Convert industry codes #80

Closed waldoj closed 9 years ago

waldoj commented 9 years ago

2_corporates doesn't have industry codes, but instead has actual textual descriptions. But 3_lp and 9_llc uses codes. But not in cisbemon.txt—there it uses industry codes. So somewhere we're converting codes to industry names, but I don't actually know where. :-/ Figure out where that's happening, and then do the same to 3_lp and 9_llc.

waldoj commented 9 years ago

We're converting industry codes to names in the transformation stanza.

waldoj commented 9 years ago

Yup, it's in the generic conversion step:

for index, conversion in enumerate(lookup_table):
    if int(conversion["table-identifier"]) == table_id:
        if conversion["table-code"] == line[name]:
            line[name] = conversion["table-desc"]
            break

I'll need to go through the table mapping to figure out why this isn't working on 3 and 9.

waldoj commented 9 years ago

OK, I think I've IDed the problem. It is in the above code. Specifically, if int(conversion["table-identifier"]) == table_id:. That is, the conversions only occur if the ID of the file being processed matches the table identifier within the conversion file (1_tables). That's a totally wrong assumption. That is, there's no correlation between the table-identifier column and the file numbers. We don't only use industry codes (table-identifier: 03) on file 3 (LP).

That said, I don't understand how this is working at all right now, and yet it is. That is, we're getting the proper corporation statuses (table 01), but that should never be showing up anywhere, given my understanding of how this works.

Anyway, I think the solution to this is to modify the YAML to specify the table identifier for each field for which there is an identifier to cross reference against, using the table types listing to make sure that everything is being used properly. That'll eliminate any need for logic to address this within crump, and make it all rather more clear.

waldoj commented 9 years ago

Hey, looks like Past Waldo already took care of this. Thanks, Past Waldo!

So, table_id is the name of a field in the YAML table maps, and it turns out that I just hadn't specified it in 9_llc. So, that's where there were no industry codes to be found there. The minimal documentation for 3_lp claims that industry codes aren't used there. I want to research that further, though.

waldoj commented 9 years ago

Despite the SCC's documentation (such as it is), it's not true that industry codes aren't used in 3_lp. There are 80 records that contain industry codes, or 5,787 if you count industry code 00 ("GENERAL"). And I think that it should be counted, because there are also 4,959 records that have no code at all, indicating that 00 isn't a placeholder.

jalbertbowden commented 9 years ago

past waldo is pretty kewl