kestra-io / plugin-serdes

https://kestra.io/plugins/plugin-serdes/
Apache License 2.0
2 stars 5 forks source link

XML reader is not working as expected #86

Open shrutimantri opened 7 months ago

shrutimantri commented 7 months ago

Expected Behavior

When XML file with items are read, the records should be read in ion format without items or item in the ion file. Example: The following XML file:

<?xml version='1.0' encoding='UTF-8'?>
<items>
  <item>
    <job_title>BI Data Analyst</job_title>
    <avg_salary>836644.8</avg_salary>
  </item>
  <item>
    <job_title>ML Engineer</job_title>
    <avg_salary>679247.63</avg_salary>
  </item>
  <item>
    <job_title>Data Science Manager</job_title>
    <avg_salary>391371.17</avg_salary>
  </item>
  <item>
    <job_title>Business Data Analyst</job_title>
    <avg_salary>286000.0</avg_salary>
  </item>
  <item>
    <job_title>Data Scientist</job_title>
    <avg_salary>257422.32</avg_salary>
  </item>
  <item>
    <job_title>Computer Vision Engineer</job_title>
    <avg_salary>220583.33</avg_salary>
  </item>
  <item>
    <job_title>AI Scientist</job_title>
    <avg_salary>193666.67</avg_salary>
  </item>
  <item>
    <job_title>Applied Scientist</job_title>
    <avg_salary>190614.29</avg_salary>
  </item>
  <item>
    <job_title>Machine Learning Engineer</job_title>
    <avg_salary>175270.55</avg_salary>
  </item>
  <item>
    <job_title>Research Scientist</job_title>
    <avg_salary>161292.29</avg_salary>
  </item>
  <item>
    <job_title>Data Architect</job_title>
    <avg_salary>160283.26</avg_salary>
  </item>
  <item>
    <job_title>Data Engineer</job_title>
    <avg_salary>157510.03</avg_salary>
  </item>
  <item>
    <job_title>Machine Learning Scientist</job_title>
    <avg_salary>154638.64</avg_salary>
  </item>
  <item>
    <job_title>Research Engineer</job_title>
    <avg_salary>146618.11</avg_salary>
  </item>
  <item>
    <job_title>Analytics Engineer</job_title>
    <avg_salary>142703.15</avg_salary>
  </item>
  <item>
    <job_title>Data Science Consultant</job_title>
    <avg_salary>141937.5</avg_salary>
  </item>
  <item>
    <job_title>Data Analytics Manager</job_title>
    <avg_salary>141463.33</avg_salary>
  </item>
  <item>
    <job_title>Machine Learning Infrastructure Engineer</job_title>
    <avg_salary>141076.36</avg_salary>
  </item>
  <item>
    <job_title>BI Developer</job_title>
    <avg_salary>129846.15</avg_salary>
  </item>
  <item>
    <job_title>Data Specialist</job_title>
    <avg_salary>122083.33</avg_salary>
  </item>
  <item>
    <job_title>Data Manager</job_title>
    <avg_salary>120203.05</avg_salary>
  </item>
  <item>
    <job_title>Data Analyst</job_title>
    <avg_salary>116348.29</avg_salary>
  </item>
</items>

should be read by XML reader as:

Screenshot 2024-02-05 at 1 41 38 PM

Actual Behaviour

The following XML file:

<?xml version='1.0' encoding='UTF-8'?>
<items>
  <item>
    <job_title>BI Data Analyst</job_title>
    <avg_salary>836644.8</avg_salary>
  </item>
  <item>
    <job_title>ML Engineer</job_title>
    <avg_salary>679247.63</avg_salary>
  </item>
  <item>
    <job_title>Data Science Manager</job_title>
    <avg_salary>391371.17</avg_salary>
  </item>
  <item>
    <job_title>Business Data Analyst</job_title>
    <avg_salary>286000.0</avg_salary>
  </item>
  <item>
    <job_title>Data Scientist</job_title>
    <avg_salary>257422.32</avg_salary>
  </item>
  <item>
    <job_title>Computer Vision Engineer</job_title>
    <avg_salary>220583.33</avg_salary>
  </item>
  <item>
    <job_title>AI Scientist</job_title>
    <avg_salary>193666.67</avg_salary>
  </item>
  <item>
    <job_title>Applied Scientist</job_title>
    <avg_salary>190614.29</avg_salary>
  </item>
  <item>
    <job_title>Machine Learning Engineer</job_title>
    <avg_salary>175270.55</avg_salary>
  </item>
  <item>
    <job_title>Research Scientist</job_title>
    <avg_salary>161292.29</avg_salary>
  </item>
  <item>
    <job_title>Data Architect</job_title>
    <avg_salary>160283.26</avg_salary>
  </item>
  <item>
    <job_title>Data Engineer</job_title>
    <avg_salary>157510.03</avg_salary>
  </item>
  <item>
    <job_title>Machine Learning Scientist</job_title>
    <avg_salary>154638.64</avg_salary>
  </item>
  <item>
    <job_title>Research Engineer</job_title>
    <avg_salary>146618.11</avg_salary>
  </item>
  <item>
    <job_title>Analytics Engineer</job_title>
    <avg_salary>142703.15</avg_salary>
  </item>
  <item>
    <job_title>Data Science Consultant</job_title>
    <avg_salary>141937.5</avg_salary>
  </item>
  <item>
    <job_title>Data Analytics Manager</job_title>
    <avg_salary>141463.33</avg_salary>
  </item>
  <item>
    <job_title>Machine Learning Infrastructure Engineer</job_title>
    <avg_salary>141076.36</avg_salary>
  </item>
  <item>
    <job_title>BI Developer</job_title>
    <avg_salary>129846.15</avg_salary>
  </item>
  <item>
    <job_title>Data Specialist</job_title>
    <avg_salary>122083.33</avg_salary>
  </item>
  <item>
    <job_title>Data Manager</job_title>
    <avg_salary>120203.05</avg_salary>
  </item>
  <item>
    <job_title>Data Analyst</job_title>
    <avg_salary>116348.29</avg_salary>
  </item>
</items>

be read by XML reader as:

{"item":[{"avg_salary":836644.8,"job_title":"BI Data Analyst"},{"avg_salary":679247.63,"job_title":"ML Engineer"},{"avg_salary":391371.17,"job_title":"Data Science Manager"},{"avg_salary":286000,"job_title":"Business Data Analyst"},{"avg_salary":257422.32,"job_title":"Data Scientist"},{"avg_salary":220583.33,"job_title":"Computer Vision Engineer"},{"avg_salary":193666.67,"job_title":"AI Scientist"},{"avg_salary":190614.29,"job_title":"Applied Scientist"},{"avg_salary":175270.55,"job_title":"Machine Learning Engineer"},{"avg_salary":161292.29,"job_title":"Research Scientist"},{"avg_salary":160283.26,"job_title":"Data Architect"},{"avg_salary":157510.03,"job_title":"Data Engineer"},{"avg_salary":154638.64,"job_title":"Machine Learning Scientist"},{"avg_salary":146618.11,"job_title":"Research Engineer"},{"avg_salary":142703.15,"job_title":"Analytics Engineer"},{"avg_salary":141937.5,"job_title":"Data Science Consultant"},{"avg_salary":141463.33,"job_title":"Data Analytics Manager"},{"avg_salary":141076.36,"job_title":"Machine Learning Infrastructure Engineer"},{"avg_salary":129846.15,"job_title":"BI Developer"},{"avg_salary":122083.33,"job_title":"Data Specialist"},{"avg_salary":120203.05,"job_title":"Data Manager"},{"avg_salary":116348.29,"job_title":"Data Analyst"}]}
Screenshot 2024-02-05 at 1 43 14 PM

Steps To Reproduce

  1. Run the following flow:

    id: xml-writer
    namespace: company.team
    description:  Analyse  data  salaries.
    tasks:
    - id:  download_csv
    type:  io.kestra.plugin.fs.http.Download
    description:  Data  Job  salaries  from  2020  to  2023  (source  ai-jobs.net)
    uri:  https://gist.githubusercontent.com/Ben8t/f182c57f4f71f350a54c65501d30687e/raw/940654a8ef6010560a44ad4ff1d7b24c708ebad4/salary-data.csv
    
    - id:  average_salary_by_position
    type:  io.kestra.plugin.jdbc.duckdb.Query
    inputFiles:
      data.csv:  "{{ outputs.download_csv.uri }}"
    sql:  |
      SELECT
        job_title,
        ROUND(AVG(salary),2)  AS  avg_salary
      FROM  read_csv_auto('{{workingDir}}/data.csv',  header=True)
      GROUP  BY  job_title
      HAVING  COUNT(job_title)  >  10
      ORDER  BY  avg_salary  DESC;
    store:  true
    - id:  export_result
    type: "io.kestra.plugin.serdes.xml.XmlWriter"
    from:  "{{ outputs.average_salary_by_position.uri }}"
    - id: xml_reader
    type: io.kestra.plugin.serdes.xml.XmlReader
    from: "{{ outputs.export_result.uri }}"
  2. Check the output of xml_reader task.

Environment Information

Example flow

id: xml-writer
namespace: company.team
description:  Analyse  data  salaries.
tasks:
  - id:  download_csv
    type:  io.kestra.plugin.fs.http.Download
    description:  Data  Job  salaries  from  2020  to  2023  (source  ai-jobs.net)
    uri:  https://gist.githubusercontent.com/Ben8t/f182c57f4f71f350a54c65501d30687e/raw/940654a8ef6010560a44ad4ff1d7b24c708ebad4/salary-data.csv

  - id:  average_salary_by_position
    type:  io.kestra.plugin.jdbc.duckdb.Query
    inputFiles:
      data.csv:  "{{ outputs.download_csv.uri }}"
    sql:  |
      SELECT
        job_title,
        ROUND(AVG(salary),2)  AS  avg_salary
      FROM  read_csv_auto('{{workingDir}}/data.csv',  header=True)
      GROUP  BY  job_title
      HAVING  COUNT(job_title)  >  10
      ORDER  BY  avg_salary  DESC;
    store:  true
  - id:  export_result
    type: "io.kestra.plugin.serdes.xml.XmlWriter"
    from:  "{{ outputs.average_salary_by_position.uri }}"
  - id: xml_reader
    type: io.kestra.plugin.serdes.xml.XmlReader
    from: "{{ outputs.export_result.uri }}"