kestra-io / plugin-serdes

https://kestra.io/plugins/plugin-serdes/
Apache License 2.0
2 stars 5 forks source link

IonToExcel shows memory issue #68

Closed Ben8t closed 9 months ago

Ben8t commented 9 months ago

Expected Behavior

Using the IonToExcel shows some memory issues.

The current flow is running in parallel with ~10/40mb Excel files

id: ExtraccionInformesBancos
namespace: prod.data
labels:
  Origen: FTP Norma43
  Destino: Selene/Compartida/Bancos
inputs:
  - type: STRING
    name: sql
tasks:
  - id: ExtraccionDatos
    type: io.kestra.plugin.jdbc.oracle.Query
    password: "{{namespace.user_bbdd.qonciliador_adm}}"
    sql: "{{inputs.sql}}"
    store: true
    url: "{{namespace.url.kratos}}"
    username: "{{namespace.user_bbdd.qonciliador_adm}}"

  - id: GeneramosInforme
    type: io.kestra.plugin.serdes.excel.IonToExcel
    dateFormat: dd-MM-yyyy
    dateTimeFormat: dd-MM-yyyy'T'HH:mm:ss.SSS[XXX]
    from: "{{outputs.ExtraccionDatos.uri}}"
    header: true

image

Actual Behaviour

No response

Steps To Reproduce

You can find some csv files up to 200mb here if needed https://www.stats.govt.nz/large-datasets/csv-files-for-download/ And Excel file sample (small one) https://go.microsoft.com/fwlink/?LinkID=521962

Environment Information

Example flow

No response

anna-geller commented 9 months ago

simple flow works

id: duckdb_query
namespace: dev

tasks:
  - id: download_csv
    type: io.kestra.plugin.fs.http.Download
    uri: https://raw.githubusercontent.com/kestra-io/datasets/main/csv/salaries.csv

  - id: avg_salary_by_job_title
    type: io.kestra.plugin.jdbc.duckdb.Query
    inputFiles:
      data.csv: "{{ outputs.download_csv.uri }}"
    sql: |
      SELECT 
        job_title,
        ROUND(AVG(salary),2) AS avg_salary
      FROM read_csv_auto('{{workingDir}}/data.csv', header=True)
      GROUP BY job_title
      HAVING COUNT(job_title) > 10
      ORDER BY avg_salary DESC;
    store: true

  - id: ion_to_excel
    type: io.kestra.plugin.serdes.excel.IonToExcel
    from: "{{ outputs.avg_salary_by_job_title.uri }}"

image

anna-geller commented 9 months ago

Even large file that exceeds the allowed nr of rows in Excel worked without hitting any memory limits

image

image

might be challenging to reproduce

id: parquet_to_excel
namespace: dev

tasks:
  - id: parquet
    type: io.kestra.plugin.jdbc.duckdb.Query
    sql: |
      INSTALL parquet;
      LOAD parquet;
      INSTALL httpfs;
      LOAD httpfs;
      SELECT * from read_parquet('https://huggingface.co/datasets/kestra/datasets/resolve/main/jaffle-large/raw_items.parquet?download=true');
    store: true

  - id: ion_to_excel
    type: io.kestra.plugin.serdes.excel.IonToExcel
    from: "{{ outputs.parquet.uri }}"
anna-geller commented 9 months ago

limiting query results to 1 mio rows does reproduce the error:

id: parquet_to_excel
namespace: dev

tasks:
  - id: parquet
    type: io.kestra.plugin.jdbc.duckdb.Query
    sql: |
      INSTALL parquet;
      LOAD parquet;
      INSTALL httpfs;
      LOAD httpfs;
      SELECT * 
      FROM read_parquet('https://huggingface.co/datasets/kestra/datasets/resolve/main/jaffle-large/raw_items.parquet?download=true')
      LIMIT 1000000;
    store: true

  - id: ion_to_excel
    type: io.kestra.plugin.serdes.excel.IonToExcel
    from: "{{ outputs.parquet.uri }}"

image