Closed Ben8t closed 9 months ago
simple flow works
id: duckdb_query
namespace: dev
tasks:
- id: download_csv
type: io.kestra.plugin.fs.http.Download
uri: https://raw.githubusercontent.com/kestra-io/datasets/main/csv/salaries.csv
- id: avg_salary_by_job_title
type: io.kestra.plugin.jdbc.duckdb.Query
inputFiles:
data.csv: "{{ outputs.download_csv.uri }}"
sql: |
SELECT
job_title,
ROUND(AVG(salary),2) AS avg_salary
FROM read_csv_auto('{{workingDir}}/data.csv', header=True)
GROUP BY job_title
HAVING COUNT(job_title) > 10
ORDER BY avg_salary DESC;
store: true
- id: ion_to_excel
type: io.kestra.plugin.serdes.excel.IonToExcel
from: "{{ outputs.avg_salary_by_job_title.uri }}"
Even large file that exceeds the allowed nr of rows in Excel worked without hitting any memory limits
might be challenging to reproduce
id: parquet_to_excel
namespace: dev
tasks:
- id: parquet
type: io.kestra.plugin.jdbc.duckdb.Query
sql: |
INSTALL parquet;
LOAD parquet;
INSTALL httpfs;
LOAD httpfs;
SELECT * from read_parquet('https://huggingface.co/datasets/kestra/datasets/resolve/main/jaffle-large/raw_items.parquet?download=true');
store: true
- id: ion_to_excel
type: io.kestra.plugin.serdes.excel.IonToExcel
from: "{{ outputs.parquet.uri }}"
limiting query results to 1 mio rows does reproduce the error:
id: parquet_to_excel
namespace: dev
tasks:
- id: parquet
type: io.kestra.plugin.jdbc.duckdb.Query
sql: |
INSTALL parquet;
LOAD parquet;
INSTALL httpfs;
LOAD httpfs;
SELECT *
FROM read_parquet('https://huggingface.co/datasets/kestra/datasets/resolve/main/jaffle-large/raw_items.parquet?download=true')
LIMIT 1000000;
store: true
- id: ion_to_excel
type: io.kestra.plugin.serdes.excel.IonToExcel
from: "{{ outputs.parquet.uri }}"
Expected Behavior
Using the
IonToExcel
shows some memory issues.The current flow is running in parallel with ~10/40mb Excel files
Actual Behaviour
No response
Steps To Reproduce
You can find some csv files up to 200mb here if needed https://www.stats.govt.nz/large-datasets/csv-files-for-download/ And Excel file sample (small one) https://go.microsoft.com/fwlink/?LinkID=521962
Environment Information
Example flow
No response