openva / crump

A parser for the Virginia State Corporation Commission's business registration records.
https://vabusinesses.org/
MIT License
20 stars 3 forks source link

Blank dates aren't being handled properly by Elasticsearch #70

Closed waldoj closed 10 years ago

waldoj commented 10 years ago

It throws an error in response to blank date fields (e.g., "expiration-date": ""), like such:

{"create": {"_index":"finance","_type":"2","_id":"mYx5To8sTEaD_AyuIT6Dqg","error":"MapperParsingException[failed to parse [expiration-date]]; nested: MapperParsingException[failed to parse date field [], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: \"\"]; "}}

My guess is that blank dates should be set to null, rather than an empty string.

waldoj commented 10 years ago

So I tried changing:

if line[name] == "0000-00-00":
    line[name] = ""

to:

if line[name] == "0000-00-00":
    line[name] = 'null'

but that only resulted in JSON values of "null" (with quotes), rather than null. And, likewise, literal "null" values within the CSV. So I've got to figure out how to make sure that we wind up with actual blank (zero-length) fields in the CSV, but null values in the JSON.

I'm going to try setting a value of None (which is null within Python) and see if that helps.

waldoj commented 10 years ago

None is making trouble with the remove_non_ascii() function.

waldoj commented 10 years ago

I think closing #23 will fix this.

waldoj commented 10 years ago

OK, I have no idea why remove_non_ascii() was ever an issue (I'm not sure that it was). The real problem here is that the value of None is being quoted ("None"), and thus not being changed into null by json.dumps. I haven't yet figured out where that quoting is happening.