Open simonw opened 4 years ago
I need to be able to detect likely column types for this - see https://github.com/simonw/sqlite-utils/issues/179
@simonw it will also be needed to be able to specify the separator: I couldn't find a way to specify it with sqllite-utils insert
.
Loading large files quicker will be a welcome boost!
I have use cases where I need to load large RNA sequencing data, of ~20-30 GBs, into sqlite tables, and using pandas is a bottleneck. I'm currently exploring sqlite-utils insert
but would welcome the approachability of csvs-to-sqlite
(eg, super simple to dump a whole directory of files into a sqlite db)
EDIT: clarify what I mean
Should I just use sqlite-utils and not this library? Especially when just beginning?
Should I just use sqlite-utils and not this library? Especially when just beginning?
csvs-to-sqlite
still seems great for dumping a large number of (not too huge) CSVs into a single db (sqlite) file.
For example, if you have a directory or subdirectories of CSVs that you want to bundle together:
csvs-to-sqlite ~/Downloads/*.csv my-downloads.db
csvs-to-sqlite ~/path/to/directory all-my-csvs.db
etc
AFAIK, sqlite-utils
would need you to write additional code to handle more than 1 CSV (or TSV/JSON) file at a time.
Should I just use sqlite-utils and not this library? Especially when just beginning?
If your needs are simple - just loading a single CSV file - then yes, I'd recommend sqlite-utils
instead - it has better performance as it works by streaming files rather than loading them all into memory.
csvs-to-sqlite
is still better if you want to transform a folder full of files.
My sqlite-utils library has evolved to the point where I think it would make a good foundation for the next version of
csvs-to-sqlite
.The main feature I'm excited about here is being able to handle giant CSV files - right now they have to be loaded into memory by Pandas, but sqlite-utils has similar functionality which handles them as streams, reducing the amount of memory needed to consume a huge file.
I intend to keep as much of the CLI API the same for the new version, but this is a big change so it's likely some cases will break. As such, I intend to keep the
1.x
branch around (and maintained with bug fixes) for users who find that 2.0 doesn't work for them.I'll pin this issue for a few weeks so people can comment on this plan before I start executing.