simonw / git-history

Tools for analyzing Git history using SQLite
Apache License 2.0
191 stars 18 forks source link

Add a `--dialect` option for forcing a CSV dialect #27

Closed simonw closed 2 years ago

simonw commented 2 years ago

Running against https://github.com/simonw/fara-history

(git-history) git-history % git-history file fara.db ../fara-history/FARA_All_Registrants.csv --repo ../fara-history --id Registration_Number --changed --branch master --csv
  [------------------------------------]  1/376    0%Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/bin/git-history", line 33, in <module>
    sys.exit(load_entry_point('git-history', 'console_scripts', 'git-history')())
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/git-history/git_history/cli.py", line 246, in file
    item = fix_reserved_columns(item)
  File "/Users/simon/Dropbox/Development/git-history/git_history/utils.py", line 8, in fix_reserved_columns
    if not any(reserved_with_suffix_re.match(key) for key in item):
  File "/Users/simon/Dropbox/Development/git-history/git_history/utils.py", line 8, in <genexpr>
    if not any(reserved_with_suffix_re.match(key) for key in item):
TypeError: expected string or bytes-like object

After much debugging, it turns out the problem is running the CSV parser against this specific revision of the file: https://github.com/simonw/fara-history/blob/ab27087f642680697db6c914d094bf3d06b363f3/FARA_All_Registrants.csv

Here's what's happening:

>>> import csv, httpx, io
>>> content = httpx.get("https://raw.githubusercontent.com/simonw/fara-history/ab27087f642680697db6c914d094bf3d06b363f3/FARA_All_Registrants.csv").content
>>> decoded = content.decode("utf-8")
>>> dialect = csv.Sniffer().sniff(decoded[:512])
>>> (dialect.delimiter, dialect.doublequote, dialect.escapechar, dialect.lineterminator, dialect.quotechar, dialect.quoting, dialect.skipinitialspace)
(',', False, None, '\r\n', '"', 0, False)
>>> reader = csv.DictReader(io.StringIO(decoded), dialect=dialect)
>>> items = list(reader)
>>> [it for it in items if it["Registration_Number"] == '4797']
[{'Registration_Number': '4797',
  'Registration_Date': '04/20/1993',
  'Termination_Date': '05/06/1993',
  'Name': 'National Petroleum Company, "Sudan""',
  'Business_Name': ' Ltd."',
  'Address_1': '',
  'Address_2': '525 South Lancaster Street',
  'City': '',
  'State': 'Arlington',
  'Zip': 'VA',
  None: ['22204']}]

What is going on with that last item of None: ['22204']?

simonw commented 2 years ago

It looks like this is being caused by the quoting on this line:

"4797","04/20/1993","05/06/1993","National Petroleum Company, ""Sudan"", Ltd.","","525 South Lancaster Street","","Arlington","VA","22204"

simonw commented 2 years ago

I still have no idea how the csv module could ever decide that None: ['2204'] is an OK thing to return, but switching on the dialect doublequote option fixes the problem:

>>> dialect.doublequote = True
>>> reader = csv.DictReader(io.StringIO(decoded), dialect=dialect)
>>> items = list(reader)
>>> [it for it in items if it["Registration_Number"] == '4797']
[{'Registration_Number': '4797',
  'Registration_Date': '04/20/1993',
  'Termination_Date': '05/06/1993',
  'Name': 'National Petroleum Company, "Sudan", Ltd.',
  'Business_Name': '',
  'Address_1': '525 South Lancaster Street',
  'Address_2': '',
  'City': 'Arlington',
  'State': 'VA',
  'Zip': '22204'}]
simonw commented 2 years ago

I tried fixing this by bumping up the dialect sniffer to use the entire content, not just the first 512 bytes - but doing so gave me a weird result where some pages were decoded as having Registration Number rather than Registration_Number as the key, breaking things.

For this particular file I'm going to pass --convert with an explicit encoding, but maybe the --csv mode needs to grow some extra options?

simonw commented 2 years ago

I tried fixing this by bumping up the dialect sniffer to use the entire content, not just the first 512 bytes - but doing so gave me a weird result where some pages were decoded as having Registration Number rather than Registration_Number as the key, breaking things.

Actually that wasn't a CSV parsing problem - it turns out there are versions of that file which DO use alternative column headings: https://github.com/simonw/fara-history/blob/6961011b11f9b5c58d5dd5703f80390218fb7adf/FARA_All_Registrants.csv

simonw commented 2 years ago

Trying this:

which git-history file fara.db \
  ../fara-history/FARA_All_Registrants.csv \
  --repo ../fara-history --id "Registration_Number" \
  --changed --branch master --convert '
decoded = content.decode("utf-8")
reader = csv.DictReader(io.StringIO(decoded), dialect="excel")
for row in reader:
  yield dict((key.replace(" ", "_"), value) for key, value in row.items())
' --import io --import csv

It worked!

simonw commented 2 years ago

Documented here: https://github.com/simonw/git-history/tree/3458d4f339c95608862818b0b375b62fcfbd7896#additional-options