mammoth-eu / mammoth-commons

Contains modules with the project's research results
Other
2 stars 6 forks source link

data_custom_csv not working in toolkit #21

Open georgiosn opened 1 month ago

georgiosn commented 1 month ago

Latest data custom csv not working in toolkit.

Input: {"path":"https://archive.ics.uci.edu/static/public/222/bank+marketing.zip/bank/bank.csv","delimiter":";","numeric":["age","duration","campaign","pdays","previous"],"categorical":["job","marital","education","default","housing","loan","contact","poutcome"],"label":"y"}

Log from KFP:

Version: [2.2.0](https://www.github.com/kubeflow/pipelines/commit/dd59f48cdd0f6cd7fac40306277ef5f3dad6e263)
[Report an Issue](https://github.com/kubeflow/pipelines/issues/new/choose)
[Experiments](http://kfp.local.exus.ai:8082/#/experiments)
[Tabular](http://kfp.local.exus.ai:8082/#/experiments/details/54dad6ff-1e0f-44e3-9a07-aff05d2e79f5)
Tabular
Layers
data-custom-csv
time="2024-10-21T12:19:56.742Z" level=info msg="capturing logs" argo=true
time="2024-10-21T12:19:56.832Z" level=info msg="capturing logs" argo=true
I1021 12:19:56.869499      32 launcher_v2.go:90] input ComponentSpec:{
  "inputDefinitions": {
    "parameters": {
      "data_custom_csv__params": {
        "parameterType": "STRUCT",
        "defaultValue": {
          "categorical": "None",
          "delimiter": ",",
          "label": "None",
          "numeric": "None",
          "path": "",
          "skip_invalid_lines": true
        },
        "isOptional": true
      }
    }
  },
  "outputDefinitions": {
    "artifacts": {
      "output": {
        "artifactType": {
          "schemaTitle": "system.Dataset",
          "schemaVersion": "0.0.1"
        }
      }
    },
    "parameters": {
      "Output": {
        "parameterType": "STRING"
      }
    }
  },
  "executorLabel": "exec-data-custom-csv"
}
I1021 12:19:56.871293      32 cache.go:116] Connecting to cache endpoint 10.43.199.76:8887
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[KFP Executor 2024-10-21 12:20:03,765 INFO]: --component_module_path is not specified. Looking for component `data_custom_csv` in config file `kfp_config.ini` instead
[KFP Executor 2024-10-21 12:20:03,768 INFO]: Loading KFP component "data_custom_csv" from catalogue/dataset_loaders/custom_csv.py (directory "catalogue/dataset_loaders" and module name "custom_csv")
[KFP Executor 2024-10-21 12:20:08,164 INFO]: generated new fontManager
[KFP Executor 2024-10-21 12:20:08,893 INFO]: Got executor_input:
{
    "inputs": {
        "parameterValues": {
            "data_custom_csv__params": {
                "categorical": [
                    "job",
                    "marital",
                    "education",
                    "default",
                    "housing",
                    "loan",
                    "contact",
                    "poutcome"
                ],
                "delimiter": ";",
                "label": "y",
                "numeric": [
                    "age",
                    "duration",
                    "campaign",
                    "pdays",
                    "previous"
                ],
                "path": "https://archive.ics.uci.edu/static/public/222/bank+marketing.zip/bank/bank.csv"
            }
        }
    },
    "outputs": {
        "parameters": {
            "Output": {
                "outputFile": "/tmp/kfp/outputs/Output"
            }
        },
        "artifacts": {
            "output": {
                "artifacts": [
                    {
                        "type": {
                            "schemaTitle": "system.Dataset",
                            "schemaVersion": "0.0.1"
                        },
                        "uri": "minio://mlpipeline/v2/artifacts/tabular/bfcdb8e6-972f-496d-be8c-d53469967536/data-custom-csv/output"
                    }
                ]
            }
        },
        "outputFile": "/tmp/kfp_outputs/output_metadata.json"
    }
}
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/kfp/dsl/executor_main.py", line 109, in <module>
    executor_main()
  File "/usr/local/lib/python3.11/site-packages/kfp/dsl/executor_main.py", line 101, in executor_main
    output_file = executor.execute()
                  ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kfp/dsl/executor.py", line 361, in execute
    result = self.func(**func_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 17, in kfp_method
  File "/usr/local/src/kfp/components/catalogue/dataset_loaders/custom_csv.py", line 45, in data_custom_csv
    raw_data = fb.bench.loader.read_csv(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fairbench/bench/loader.py", line 71, in read_csv
    _extract_nested_zip(temp, extract_to)
  File "/usr/local/lib/python3.11/site-packages/fairbench/bench/loader.py", line 46, in _extract_nested_zip
    with zipfile.ZipFile(file, "r") as zfile:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1295, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'data/bank+marketing.zip'
Error downloading file: HTTP Error 502: Bad Gateway
data/bank+marketing.zip data/
I1021 12:20:10.328800      32 launcher_v2.go:151] publish success.
F1021 12:20:10.328902      32 main.go:49] failed to execute component: exit status 1
time="2024-10-21T12:20:10.842Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1
time="2024-10-21T12:20:11.753Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

With input: {"path":"http://host.k3d.internal:5000/bank/bank.csv","delimiter":";","numeric":["age","duration","campaign","pdays","previous"],"c ategorical":["job","marital","education","default","housing","loan","contact","poutcome"],"label":"y"}

got different log in KFP: check attached txt output_kfp.txt

maniospas commented 1 month ago

Working on this. For the time being, is autocsv alright with path https://archive.ics.uci.edu/static/public/222/bank+marketing.zip/bank/bank.csv ? (It doesn't need the rest of the arguments.)

You may also encounter #20 next, so let's prioritize that.

georgiosn commented 1 month ago

Update also here.

After fixing the pickle issue, auto csv works.

With input: {"path":"http://host.k3d.internal:5000/bank/bank.csv","delimiter":";","numeric":["age","duration","campaign","pdays","previous"],"c ategorical":["job","marital","education","default","housing","loan","contact","poutcome"],"label":"y"}

custom csv has the following error:

time="2024-10-22T12:02:04.957Z" level=info msg="capturing logs" argo=true
time="2024-10-22T12:02:05.063Z" level=info msg="capturing logs" argo=true
I1022 12:02:05.119151      32 launcher_v2.go:90] input ComponentSpec:{
  "inputDefinitions": {
    "parameters": {
      "data_custom_csv__params": {
        "parameterType": "STRUCT",
        "defaultValue": {
          "categorical": "None",
          "delimiter": ",",
          "label": "None",
          "numeric": "None",
          "path": "",
          "skip_invalid_lines": true
        },
        "isOptional": true
      }
    }
  },
  "outputDefinitions": {
    "artifacts": {
      "output": {
        "artifactType": {
          "schemaTitle": "system.Dataset",
          "schemaVersion": "0.0.1"
        }
      }
    },
    "parameters": {
      "Output": {
        "parameterType": "STRING"
      }
    }
  },
  "executorLabel": "exec-data-custom-csv"
}
I1022 12:02:05.121093      32 cache.go:116] Connecting to cache endpoint 10.43.199.76:8887
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[KFP Executor 2024-10-22 12:02:10,393 INFO]: --component_module_path is not specified. Looking for component `data_custom_csv` in config file `kfp_config.ini` instead
[KFP Executor 2024-10-22 12:02:10,395 INFO]: Loading KFP component "data_custom_csv" from catalogue/dataset_loaders/custom_csv.py (directory "catalogue/dataset_loaders" and module name "custom_csv")
[KFP Executor 2024-10-22 12:02:13,942 INFO]: Got executor_input:
{
    "inputs": {
        "parameterValues": {
            "data_custom_csv__params": {
                "categorical": [
                    "job",
                    "marital",
                    "education",
                    "default",
                    "housing",
                    "loan",
                    "contact",
                    "poutcome"
                ],
                "delimiter": ";",
                "label": "y",
                "numeric": [
                    "age",
                    "duration",
                    "campaign",
                    "pdays",
                    "previous"
                ],
                "path": "http://host.k3d.internal:5000/bank/bank.csv"
            }
        }
    },
    "outputs": {
        "parameters": {
            "Output": {
                "outputFile": "/tmp/kfp/outputs/Output"
            }
        },
        "artifacts": {
            "output": {
                "artifacts": [
                    {
                        "type": {
                            "schemaTitle": "system.Dataset",
                            "schemaVersion": "0.0.1"
                        },
                        "uri": "minio://mlpipeline/v2/artifacts/tabular4/420c40fc-05e7-45b0-a1e3-afdede73b29e/data-custom-csv/output"
                    }
                ]
            }
        },
        "outputFile": "/tmp/kfp_outputs/output_metadata.json"
    }
}
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/kfp/dsl/executor_main.py", line 109, in <module>
    executor_main()
  File "/usr/local/lib/python3.11/site-packages/kfp/dsl/executor_main.py", line 101, in executor_main
    output_file = executor.execute()
                  ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kfp/dsl/executor.py", line 361, in execute
    result = self.func(**func_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 17, in kfp_method
  File "/usr/local/src/kfp/components/catalogue/dataset_loaders/custom_csv.py", line 43, in data_custom_csv
    raw_data = fb.bench.loader.read_csv(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fairbench/bench/loader.py", line 78, in read_csv
    return pd.read_csv(path, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
             ^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'data//host.k3d.internal:5000/bank/bank.csv'
Error downloading file: <urlopen error [Errno -2] Name or service not known>
I1022 12:02:14.465002      32 launcher_v2.go:151] publish success.
F1022 12:02:14.465164      32 main.go:49] failed to execute component: exit status 1
time="2024-10-22T12:02:15.070Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1
time="2024-10-22T12:02:15.963Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1