palewire / django-calaccess-raw-data

A Django app to download, extract and load campaign finance and lobbying activity data from the California Secretary of State's CAL-ACCESS database
http://django-calaccess.californiacivicdata.org/
MIT License
64 stars 143 forks source link

csv or csvkit not protecting commas in clean #1506

Closed rkiddy closed 6 years ago

rkiddy commented 6 years ago

I can see what is happening here but not why.

I can see in calaccess_raw/management/commands/cleancalaccessrawfile.py that the cvs writer function is used and it should protect commas when it converts from tab-separated to comma-separated text.

For example:

import csv
with open('eggs.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile)
    spamwriter.writerow(['Spam'] * 5 + ['Baked Beans'])
    spamwriter.writerow(['Spam', 'Lovely, Spam', 'Wonderful Spam'])

gives:

Spam,Spam,Spam,Spam,Spam,Baked Beans
Spam,"Lovely, Spam",Wonderful Spam

as it should.

But see the attached file. In line 1244683, I see this in the tsv file:

LINE 1244683:
1: 1080694
2: 0
3: 5
4: EXPN
5: F461P5
6: E11
7: COM
8: Horton for Assembly, Shirley
9: 
10: 

and this in the csv file:

LINE 1244683:
1: 1080694
2: 0
3: 6
4: EXPN
5: F461P5
6: E8
7: COM
8: "Hunter for Assembly
9:  Tricia"
10: 

Note that this is not actually marked as an error. The error log points to line 1244682 and I am not seeing, yet, what is wrong with that line.

Things I will check:

1) I will pull these 3 or 4 lines into a test file and check how the stand-alone call to writer handles them. Perhaps there is something weird in the line that I cannot see.

2) Why is the import of writer from csvkit there? It is from cvs. Or at least, this is how it appears. Is this bringing in a copy of writer that I am not expecting?

3) Is there a number 3?

errs.html.gz

rkiddy commented 6 years ago

Crap. Pilot error. It is my splitter that is not seeing the comma is protected. Well. Something is wrong around that line. We will see.