openva / crump

A parser for the Virginia State Corporation Commission's business registration records.
https://vabusinesses.org/
MIT License
20 stars 3 forks source link

Close gaps in fields #53

Closed waldoj closed 9 years ago

waldoj commented 10 years ago

We're getting big gaps within text fields, like such:

{
  "res-exp-date": "2014-07-26",
  "res-zip": "L3520-9000",
  "res-street1": "2 METROPLEX DR STE 500",
  "res-street2": "",
  "res-name": "BEHAVIORAL HEALTH SYSTEMS, INC.",
  "res-city": "BIRMINGHAM",
  "res-type": "C",
  "res-number": "R155520",
  "res-state": "A",
  "res-requestor": "PATRICIA L FRIEDLEY                               BEHAVIORAL HEALTH SYSTEMS INC",
  "res-status": "60"
}

Figure out what the source of this problem is, and eliminate it.

waldoj commented 10 years ago

I'm suspicious that these are actually two separate fields that aren't being broken up properly.

waldoj commented 10 years ago

Nope, doesn't look like it. From the table map:

2 RES-REQUESTOR (A100) Name of Requestor Holding Name

Here are a few other RES-REQUESTOR fields that contain spaces:

TRENT MCKENNA          (DE)                       COMFORT SYSTEMS USA INC
J M OLSHEFSKI                                     AMERICAN E BUILDER INC
LYNN HAYES                                        LECLAIRRYAN A PROFESSIONAL CORPORATION
BARBARA J CUMMINGS                                CHESAPEAKE CORPORATION

I think that these are optional, undocumented subfields within this field. The format seems to be a person's name, optionally a state name in parentheses, and then a business name. Check with the SCC on this.

waldoj commented 10 years ago

We're getting this in 6_name. Almost all of them follow this format:

USA NATIONAL KARATE-DO FEDERATION OF VIRGINIA     (VA BEACH CI)
NAMEFORYOU.COM                                    (PETERSBURG CI)
JAMES RIVER MARINA                                (NEWPORT NEWS CI)
A ACCIDENT LAW FIRM                               (FAIRFAX CO)
A CRIMINAL TRAFFIC LAW FIRM                       (FAIRFAX CO)
C-MORE COMPETITION                                (PRINCE WILLIAM CO)
VAN KNIEST MILLWORK                               (YORK CO)
CRATE & BARREL                                    (ARLINGTON CO)
LA CUISINE THE COOKS RESOURCE                     (ALEXANDRIA CI)
WINDSOR MANAGEMENT COMPANY                        (FAIRFAX CO)
ARLINGTON CHECKS CASHED                           (ARLINGTON CO)
FISCHER ELECTRICAL                                (ROANOKE CI)
CREATIVEDGE PRODUCTIONS                           (VA BEACH CI)
SUPERIOR WIRELESS                                 (NORFOLK CI)
BON SECOURS RICHMOND HEALTH CARE FOUNDATION       (RICHMOND CI)
BRUNT PROPERTIES                                  (VA BEACH CI)
NATIONAL WINDOW & DOOR                            (MONTGOMERY CO)
CARILION CONSOLIDATED LABORATORY "CCL"            (BRISTOL CI)
CARILION CONSOLIDATED LABORATORY "CCL"            (FRANKLIN CO)
CARILION CONSOLIDATED LABORATORY "CCL"            (ROCKINGHAM CO)
CARILION CONSOLIDATED LABORATORY "CCL"            (HOPEWELL CI)
CARILION CONSOLIDATED LABORATORY "CCL"            (MONTGOMERY CO)
CARILION CONSOLIDATED LABORATORY "CCL"            (BEDFORD CO)
CARILION CONSOLIDATED LABORATORY "CCL"            (ROCKBRIDGE CO)

But not all of them:

VIRGINIA WATERFRONT INTERNATIONAL ARTS FESTIVAL,   INC., THE
NATIONAL SOCIETY OF FUND RAISING EXECUTIVES,      VIRGINIA PIEDMONT CHAPTER, INCORPORATED
LUCENT TECHNOLOGIES TECHNICAL SERVICES COMPANY,    INC.
"CONABOY O T R L OCCUPATIONAL THERAPIST, INC.,     KRISTIN S."
SUD ASSOCIATES, P.A. (FOR USE IN VA: SUD          ASSOCIATES, P.A., P.C.)

It looks like, if character 51 is an open paren, that we can figure that we're looking at a geographic identifier. We'd strip out the parens, and the last two characters (CI or CO) indicate whether it's a city or a county.

I guess we're going to need some kind of optional field type definable in the YAML table maps. That is, we don't know that these two fields (place name and place type) are going to be present, but if they are present, we want to use them as per the mapping.

waldoj commented 10 years ago

In 7_merger we're getting this, too, but for different reasons:

3 FORIEGN CORPS NOT QUALIFIED IN VIRGINIA         (2 MASSACHUSETTS & 1 GEORGIA)
2 FORIEGN CORPS NOT QUALIFIED IN VIRGINIA         (1 DELEWARE & 1 CALIFORNIA)
BABCOCK & BROWN PARALLEL MEMBER LLC (A DELAWARE   LIMITED LIABILITY COMPANY NOT QUALIFIED IN VA)
NAE FEDERAL CREDIT UNION (A FEDERALLY CHARTERED   CREDIT UNION)
ANSWER ACQUISITION CORPORATION (A DELAWARE        CORPORATION NOT QUALIFIED IN VA)
BANA PRESERVATION CORPORATION (A DELAWARE         CORPORATION NOT QUALIFIED IN VA)
LIBERTY DEVELOPMENT GROUP LLC OF FLORIDA (A       FLORDIA LLC NOT QUALIFIED IN VA)
ASSURED GUARANTY MORTGAGE INSURANCE COMPANY       (A NEW YORK CORPORATION NOT QUALIFIED IN VA)

Note the typos: "FLORDIA," "DELEWARE," and "FORIEGN." My guess is that an SCC admin is adding these notes, and their software is buggy. As with the prior file, the second field starts at character 51.

It looks like we should simply remove any whitespace in excess of one character prior to character 51, and then break apart the field based on the parens. Everything prior to the opening paren is the unqualified_name, and everything after is the rationale for why the corporation is not qualified in Virginia.

waldoj commented 10 years ago

In 8_registered_names, there are only 173 instance of this. They look like this:

ROBERT A GOULDIN                                  CHRISTIAN & BARTON LLP
VAUGHN M KLOPFENSTEIN          (IA)               COLLINS RADIO COMPANY
JAMES MCCORMICK          (DE)                     SILVERPOP SYSTEMS INC
REXFORD R FISHER SR          (NY)
TRACY A BLEVINS          (DE)
SHAEWN SCHAEFFER          (MD)                    SCHAEFFER APPRAISAL MANAGEMENT COMPANY INC
R NEAL KEESEE JR                                  WOOD ROGERS PLC
CHARLOTTE RAWLS AS PARALEGAL TO T BRAXTON MCKEE   KAUFMAN & CANOLES PC
NORFOLK SOUTHERN CORPORATION                      C/O CORPORATE SECRETARY

So we've got up to three fields here. We've got the requestor, we've got the state of the requestor in parentheses, and we've got the requestor's employer. Well, sometimes. See the Norfolk Southern example, where the requestor is an organization, not an individual, and instead of an employer, we have contact information. It's really the inverse of the other examples. Again, the dividing line is character 51.

waldoj commented 10 years ago

Finally, there's 9_llc. The division here is in the actual name of the organization, which is actually pretty harmful:

Virginia Construction and Developement Company    L.L.C.
RJH AIR CONDITIONING AND REFRIGERATION SERVICE    L.L.C.
U.S. Insurance Group Agency, L.L.C. (USED IN VA   BY: U.S. Insurance Group, L.L.C.)
Northern Virginia Residential Cleaning Services   L.L.C.
Center for Neuromuscular and Massage              Rehabilitation, LLC
WP Company (Delaware) LLC (USED IN VA BY: WP       Company LLC
MEEKS REAL ESTATE, AUCTION & APPRAISAL, LLC,      JESSE
Messer Properties, LLC.  A California Limited      Liability Company (USED IN VA BY:  Messer Propert
Salon J   LLC

This affects 2,455 records, or 0.5% of all LLCs. I think here, the most sensible thing to do is to replace multiple spaces with a single space.