simonw / csvs-to-sqlite

Convert CSV files into a SQLite database
Apache License 2.0
878 stars 69 forks source link

Column `filing_id` not found despite its manifest existence #90

Open MichaelTiemannOSC opened 2 years ago

MichaelTiemannOSC commented 2 years ago

I am trying to load 18 CSV files from CorpWatch into a SQLite database using csvs-to-sqlite. I will spare you the deprecation warnings for error_bad_lines.

michael$ time /Users/michael/Library/Python/3.8/bin/csvs-to-sqlite ~/Downloads/corpwatch_api_tables_csv ~/Temp/corpwatch.db -s $'\t'

  return pd.read_csv(
/Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (8,10) have mixed types. Specify dtype option on import or set low_memory=False.

  return pd.read_csv(
/Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (17) have mixed types. Specify dtype option on import or set low_memory=False.

  return pd.read_csv(
/Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (5,6,7,8,9) have mixed types. Specify dtype option on import or set low_memory=False.

  return pd.read_csv(
/Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (3,6,7) have mixed types. Specify dtype option on import or set low_memory=False.
  return pd.read_csv(
Loaded 18 dataframes
Traceback (most recent call last):
  File "/Users/michael/Library/Python/3.8/bin/csvs-to-sqlite", line 8, in <module>
    sys.exit(cli())
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/cli.py", line 251, in cli
    df.to_sql(df.table_name, conn, if_exists="append", index=False)
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py", line 2951, in to_sql
    return sql.to_sql(
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/io/sql.py", line 697, in to_sql
    return pandas_sql.to_sql(
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/io/sql.py", line 2190, in to_sql
    return table.insert(chunksize, method)
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/io/sql.py", line 950, in insert
    num_inserted = exec_insert(conn, keys, chunk_iter)
  File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/io/sql.py", line 1896, in _execute_insert
    conn.executemany(self.insert_statement(num_rows=1), data_list)
sqlite3.OperationalError: table ./company_filings has no column named filing_id

real    1m30.052s
user    1m12.763s
sys 0m16.640s

So the complaint is about company_filings.csv not having a filing_id. Yet:

michael$ head ~/Downloads/corpwatch_api_tables_csv/company_filings.csv filing_id cik year quarter period_of_report filing_date form_10k_url sec_21_url 47 1000180 2008 1 20071230 2008-02-25 http://www.sec.gov/Archives/edgar/data/1000180/0000950134-08-003259.txt http://www.sec.gov/Archives/edgar/data/1000180/000095013408003259/f38194exv21w1.htm 104 1000209 2008 1 20071231 2008-03-13 http://www.sec.gov/Archives/edgar/data/1000209/0001193125-08-055644.txt http://www.sec.gov/Archives/edgar/data/1000209/000119312508055644/dex211.htm 121 1000228 2008 1 20071229 2008-02-26 http://www.sec.gov/Archives/edgar/data/1000228/0000950123-08-002119.txt http://www.sec.gov/Archives/edgar/data/1000228/000095012308002119/y50229exv21w1.htm 152 1000229 2008 1 0 2008-02-22 http://www.sec.gov/Archives/edgar/data/1000229/0001000229-08-000005.txt NULL 174 1000230 2008 1 0 2008-01-29 http://www.sec.gov/Archives/edgar/data/1000230/0001193125-08-014289.txt NULL 186 1000232 2008 1 20071231 2008-03-31 http://www.sec.gov/Archives/edgar/data/1000232/0001000232-08-000005.txt http://www.sec.gov/Archives/edgar/data/1000232/000100023208000005/exhibit21subsidiaries.txt 213 1000234 2008 1 20070924 2008-03-17 http://www.sec.gov/Archives/edgar/data/1000234/0001000234-08-000003.txt http://www.sec.gov/Archives/edgar/data/1000234/000100023408000003/ex21.htm 214 1000234 2008 1 20071231 2008-03-17 http://www.sec.gov/Archives/edgar/data/1000234/0001000234-08-000004.txt http://www.sec.gov/Archives/edgar/data/1000234/000100023408000004/ex21.htm 462 1000623 2008 1 20071231 2008-03-07 http://www.sec.gov/Archives/edgar/data/1000623/0001047469-08-002365.txt http://www.sec.gov/Archives/edgar/data/1000623/000104746908002365/a2183428zex-21.htm



Why is this column not being found?
MichaelTiemannOSC commented 2 years ago

Manually setting low_memory to False in utils.py solves the problem.