mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.49k stars 65 forks source link

Join too slow #67

Closed happydentist closed 2 years ago

happydentist commented 2 years ago

Nice to have good software. But the performance of join function is poor. I have 2 csv files, each about 1 gigabyte in size, and column numbers for each file are 40, 20. Join 2 csv files use some column condition, and wait so long then give up. Can this software use index like RDBS, or other good method? Thanks !

ondohotola commented 2 years ago

I don’t hink this is what CSVQ is for.

I would load such large data sets in SQLITE3.

el

— Sent from Dr Lisse’s iPhone On 7. Aug 2021, 19:30 +0200, happydentist @.***>, wrote:

Nice to have good software. But the performance of join function is poor. I have 2 csv files, each about 1 gigabyte in size, and column numbers for each file are 40, 20. Join 2 csv files use some column condition, and wait so long then give up. Can this software use index like RDBS, or other good method? Thanks ! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

mithrandie commented 2 years ago

Csvq does not have the feature to use index, nor is there any plan to implement the feature in the future. In such cases, you should use RDBMS.

Csvq assumes a one-time query execution that reads and processes csv files every time. If it were to create an index, it would require a full line scanning. Therefore, indexing is almost useless for a one-time query execution.

derekmahar commented 2 years ago

Package data.table in R can read large CSV files and join large tables with very good performance. This data.table cheatsheet shows various data.table operations.

ondohotola commented 2 years ago

R can do much better than just data.table.

There is a specific package to read CSV and the you can save it in RDS files which load extremely quickly.

And there is tidyverse which allows you 'pipe' packages and make the scripts quite readable.

CSVQ was never intended to do huge tables/joins/indexes...

el

On 2022-02-22 19:27 , Derek Mahar wrote:

Package data.table in R can read large CSV files and join large tables with very good performance. This data.table cheatsheet shows various data.table operations.

-- Dr. Eberhard W. Lisse \ / Obstetrician & Gynaecologist @.** / | Telephone: +264 81 124 6733 (cell) PO Box 8421 Bachbrecht \ / If this email is signed with GPG/PGP 10007, Namibia ;____/ Sect 20 of Act No. 4 of 2019 may apply