Closed simonw closed 2 years ago
It looks like this is being caused by the quoting on this line:
"4797","04/20/1993","05/06/1993","National Petroleum Company, ""Sudan"", Ltd.","","525 South Lancaster Street","","Arlington","VA","22204"
I still have no idea how the csv
module could ever decide that None: ['2204']
is an OK thing to return, but switching on the dialect doublequote
option fixes the problem:
>>> dialect.doublequote = True
>>> reader = csv.DictReader(io.StringIO(decoded), dialect=dialect)
>>> items = list(reader)
>>> [it for it in items if it["Registration_Number"] == '4797']
[{'Registration_Number': '4797',
'Registration_Date': '04/20/1993',
'Termination_Date': '05/06/1993',
'Name': 'National Petroleum Company, "Sudan", Ltd.',
'Business_Name': '',
'Address_1': '525 South Lancaster Street',
'Address_2': '',
'City': 'Arlington',
'State': 'VA',
'Zip': '22204'}]
I tried fixing this by bumping up the dialect sniffer to use the entire content, not just the first 512 bytes - but doing so gave me a weird result where some pages were decoded as having Registration Number
rather than Registration_Number
as the key, breaking things.
For this particular file I'm going to pass --convert
with an explicit encoding, but maybe the --csv
mode needs to grow some extra options?
I tried fixing this by bumping up the dialect sniffer to use the entire content, not just the first 512 bytes - but doing so gave me a weird result where some pages were decoded as having
Registration Number
rather thanRegistration_Number
as the key, breaking things.
Actually that wasn't a CSV parsing problem - it turns out there are versions of that file which DO use alternative column headings: https://github.com/simonw/fara-history/blob/6961011b11f9b5c58d5dd5703f80390218fb7adf/FARA_All_Registrants.csv
Trying this:
which git-history file fara.db \
../fara-history/FARA_All_Registrants.csv \
--repo ../fara-history --id "Registration_Number" \
--changed --branch master --convert '
decoded = content.decode("utf-8")
reader = csv.DictReader(io.StringIO(decoded), dialect="excel")
for row in reader:
yield dict((key.replace(" ", "_"), value) for key, value in row.items())
' --import io --import csv
It worked!
Running against https://github.com/simonw/fara-history
After much debugging, it turns out the problem is running the CSV parser against this specific revision of the file: https://github.com/simonw/fara-history/blob/ab27087f642680697db6c914d094bf3d06b363f3/FARA_All_Registrants.csv
Here's what's happening:
What is going on with that last item of
None: ['22204']
?