mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.49k stars 65 forks source link

Can it replace the H2 database engine? #104

Open hw2499 opened 1 year ago

hw2499 commented 1 year ago

The functionality is very powerful. I have a question. If a csv file has a size of 1TB, what would the query efficiency be? Does csvq want to read all 1TB of content into memory before subsequent processing? Can it replace the H2 database engine?

mithrandie commented 1 year ago

No, csvq reads all the data into memory at runtime, so trying to handle a 1TB file is nearly impossible.

hw2499 commented 1 year ago
  1. Is there any suggestion on how large a file size csvq is suitable for processing?
  2. How can I solve this 1TB file scenario?
mithrandie commented 1 year ago

The file size you can handle depends on the query you want to execute and your system.

Csvq is not a DBMS, but an SQL interpreter that executes queries against text files such as csv. I don't know what you want to do with the file and how you want to handle it, so I can't say what method would be appropriate. In general, however, it is appropriate to use a DBMS that is designed to handle large amounts of data.

hw2499 commented 1 year ago

I want to use this SQL interpreter to process large text files, but I am not sure how efficient it is to process large files. If it is very efficient, it can do many things in memory databases

derekmahar commented 1 year ago

Have you tried DuckDB?

hw2499 commented 1 year ago

I am not looking for a database, but for a solution that can process big data in memory. I feel that CVSQ is very powerful.

derekmahar commented 1 year ago

I agree that csvq is a very capable tool for querying CSV files using SQL and I still use it where it makes sense. DuckDB can do many of the same things and can process as much data as will fit in memory without ever creating a single table. Like csvq, it can read this data directly from CSV files, but also from Parquet files, and write results to such files. You need not create a DuckDB database to query very large CSV files in memory!

hw2499 commented 1 year ago

What types of databases does DuckDB belong to? I read the document and feel that it is similar to a sqlite database. What are the main scenarios in which DuckDB is used?

derekmahar commented 1 year ago

DuckDB is similar to SQLite in that it is an embedded database, but unlike SQLite, DuckDB is a column-oriented database such that it stores data column-wise, rather than row-wise. Column stores like DuckDB are designed for analysis of large data sets while SQLite is designed more for transaction processing. Neither has a server component, so they both are designed for local data processing on a personal computer. Column-oriented databases use various column data compression methods to store data more efficiently, scan and retrieve only the columns that the query selects (good for querying wide tables), and generally execute aggregate queries on columns more quickly.

hw2499 commented 1 year ago
  1. Can duckDB process (query) big data like 1TB in memory, or are there any other solutions?
  2. Does duckDB Client API support golang?
derekmahar commented 1 year ago
  1. Can duckDB process (query) big data like 1TB in memory, or are there any other solutions?

DuckDB streams the input and results of most (or many) query operations, so most (or many) DuckDB commands can query very large data sets, even 1 TB, on computers that have much lower memory capacity.

  1. Does duckDB Client API support golang?

https://pkg.go.dev/github.com/benjajaja/go-duckdb

hw2499 commented 1 year ago

1) ok, I will look into the use of duckdb and then adapt it to etl-engine products. https://github.com/hw2499/etl-engine 2) What olap functions does duckbd support?

kpym commented 1 year ago

@hw2499 , @derekmahar Maybe the end of this discussion should be moved to the DuckDB Discussion Forum ? IMHO, the csvq issue tracker is not the place for this.

derekmahar commented 1 year ago

@kpym, I agree. I actually didn't realise that this discussion was about a csvq issue. I thought it was a csvq discussion topic. In any case, in the DuckDB discussion forums on GitHub or the DuckDB Discord server, the DuckDB developers and other users could better answer @hw2499's questions.

hw2499 commented 1 year ago

Okay, thank you. @kpym @derekmahar