csvs-to-sqlite 2.0: dropping Pandas in favour of sqlite-utils

simonw / csvs-to-sqlite

Convert CSV files into a SQLite database

Apache License 2.0

878 stars 69 forks source link

csvs-to-sqlite 2.0: dropping Pandas in favour of sqlite-utils #69

Open simonw opened 4 years ago

simonw commented 4 years ago

My sqlite-utils library has evolved to the point where I think it would make a good foundation for the next version of csvs-to-sqlite.

The main feature I'm excited about here is being able to handle giant CSV files - right now they have to be loaded into memory by Pandas, but sqlite-utils has similar functionality which handles them as streams, reducing the amount of memory needed to consume a huge file.

I intend to keep as much of the CLI API the same for the new version, but this is a big change so it's likely some cases will break. As such, I intend to keep the 1.x branch around (and maintained with bug fixes) for users who find that 2.0 doesn't work for them.

I'll pin this issue for a few weeks so people can comment on this plan before I start executing.

simonw commented 4 years ago

I need to be able to detect likely column types for this - see https://github.com/simonw/sqlite-utils/issues/179

victornoel commented 3 years ago

@simonw it will also be needed to be able to specify the separator: I couldn't find a way to specify it with sqllite-utils insert.

plpxsk commented 3 years ago

Loading large files quicker will be a welcome boost!

I have use cases where I need to load large RNA sequencing data, of ~20-30 GBs, into sqlite tables, and using pandas is a bottleneck. I'm currently exploring sqlite-utils insert but would welcome the approachability of csvs-to-sqlite (eg, super simple to dump a whole directory of files into a sqlite db)

EDIT: clarify what I mean

deverman commented 3 years ago

Should I just use sqlite-utils and not this library? Especially when just beginning?

plpxsk commented 3 years ago

Should I just use sqlite-utils and not this library? Especially when just beginning?

csvs-to-sqlite still seems great for dumping a large number of (not too huge) CSVs into a single db (sqlite) file.

For example, if you have a directory or subdirectories of CSVs that you want to bundle together:

csvs-to-sqlite ~/Downloads/*.csv my-downloads.db
csvs-to-sqlite ~/path/to/directory all-my-csvs.db
etc

AFAIK, sqlite-utils would need you to write additional code to handle more than 1 CSV (or TSV/JSON) file at a time.

simonw commented 3 years ago

Should I just use sqlite-utils and not this library? Especially when just beginning?

If your needs are simple - just loading a single CSV file - then yes, I'd recommend sqlite-utils instead - it has better performance as it works by streaming files rather than loading them all into memory.

csvs-to-sqlite is still better if you want to transform a folder full of files.