ml6team / fondant

Production-ready data processing made easy and shareable
https://fondant.ai/en/stable/
Apache License 2.0
339 stars 26 forks source link

Document expected schema for generic components #795

Open picousse opened 8 months ago

picousse commented 8 months ago

hi, some minor stuff I encountered running locally.

Current code:

import pyarrow as pa
pipeline = Pipeline(
    name="protein_pipeline",
    base_path="./data",
)

dataset = pipeline.read(
    "load_from_parquet",
    arguments={
        "dataset_uri": "/data/proteins.parquet",
    },
)

from fondant.pipeline.runner import DockerRunner

runner = DockerRunner()
runner.run(input=pipeline)

what was unclear for me:

I read https://fondant.ai/en/latest/pipeline/ and both issue did not seem clear to me.

picousse commented 8 months ago

Also datatypes have to be pyarrow datatypes in the consume. This was not clear to me based on https://github.com/ml6team/fondant/tree/main/components/load_from_parquet

RobbeSneyders commented 8 months ago
  • data path. This the path in the docker (/data/...). This is unclear based on the documentation (or I might have missed it)

This is the path on your local (or remote) file system, which will be mounted in docker. Is that how you understood it, or did you understand it differently?

  • for load_from_parquet, the produces values are crucial. there is no type inference.

Indeed, I think this is documented both in our general documentation and the component documentation.

  • Also datatypes have to be pyarrow datatypes in the consume. This was not clear to me based on

This is indeed not clearly documented in the component documentation. Would be good to add.