kedro-org / kedro-devrel

Kedro developer relations team use this for content creation ideation and execution
Apache License 2.0
0 stars 3 forks source link

Using ChatGPT to generate Kedro pipeline for a Kaggle Dataset #90

Open yetudada opened 1 year ago

yetudada commented 1 year ago

Description

This task documents a blog post that we could explore around using ChatGPT to generate a Kedro pipeline for a Kaggle Dataset. Here is a prompt I've used to do this:

Can you please create a step-by-step tutorial that uses Kedro to create a data preprocessing pipeline? 

Additional requirements: 
- You must use the `kedro pipeline create` command to create the files for the pipeline instead of recommending that users create their own `nodes.py` and `pipeline.py` files.
- I want to see example code for the Python functions
- You must also recommend use of a `requirements.txt` file
- You do not need to create a virtual environment
- Use Kedro 0.18.6

The dataset to be used is on Kaggle: https://www.kaggle.com/datasets/datazng/telecom-company-churn-rate-call-center-data
astrojuanlu commented 3 months ago

Output from ChatGPT 4:

### Step-by-Step Tutorial: Creating a Data Preprocessing Pipeline with Kedro In this tutorial, we will walk through the process of setting up a data preprocessing pipeline using Kedro, a development workflow framework that helps structure reproducible, scalable, and maintainable data science code. We will use a telecom company churn rate dataset from Kaggle. #### Prerequisites: - Python installed on your system - Kedro version 0.18.6 - Access to the dataset from Kaggle #### Step 1: Install Kedro First, ensure you have Kedro installed. You can install Kedro using pip: ```bash pip install kedro==0.18.6 ``` #### Step 2: Create a New Kedro Project Create a new directory for your project and navigate into it: ```bash mkdir kedro_telecom cd kedro_telecom ``` Now, create a new Kedro project: ```bash kedro new --starter=pandas-iris ``` Follow the prompts to set up the project. For this tutorial, you can name the project "Telecom Churn Analysis". #### Step 3: Set Up Your `requirements.txt` Create a `requirements.txt` file in the root of your project directory and add the following dependencies: ``` kedro==0.18.6 pandas scikit-learn ``` Install the requirements: ```bash pip install -r requirements.txt ``` #### Step 4: Download and Configure the Dataset Download the dataset from Kaggle and place it in the `data/01_raw` directory of your Kedro project. #### Step 5: Create a Pipeline Use Kedro CLI to create a new pipeline for data preprocessing: ```bash kedro pipeline create data_preprocessing ``` This command creates a new pipeline in `src/telecom_churn_analysis/pipelines/data_preprocessing`. #### Step 6: Define Nodes in the Pipeline Navigate to `src/telecom_churn_analysis/pipelines/data_preprocessing/nodes.py` and define the functions for your data preprocessing steps. For example: ```python import pandas as pd from sklearn.preprocessing import StandardScaler def load_data(data_path: str) -> pd.DataFrame: """Loads data from the raw data directory.""" return pd.read_csv(data_path) def clean_data(df: pd.DataFrame) -> pd.DataFrame: """Cleans the DataFrame by handling missing values, etc.""" df.dropna(inplace=True) return df def scale_features(df: pd.DataFrame) -> pd.DataFrame: """Scales numerical features using StandardScaler.""" scaler = StandardScaler() numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns df[numerical_cols] = scaler.fit_transform(df[numerical_cols]) return df ``` #### Step 7: Build the Pipeline In `src/telecom_churn_analysis/pipelines/data_preprocessing/pipeline.py`, import the nodes and build the pipeline: ```python from kedro.pipeline import Pipeline, node from .nodes import load_data, clean_data, scale_features def create_pipeline(**kwargs): return Pipeline( [ node( func=load_data, inputs="params:raw_data_path", outputs="preprocessed_data", name="load_data_node" ), node( func=clean_data, inputs="preprocessed_data", outputs="cleaned_data", name="clean_data_node" ), node( func=scale_features, inputs="cleaned_data", outputs="scaled_data", name="scale_features_node" ), ] ) ``` #### Step 8: Configure Data Paths and Parameters In `conf/base/parameters.yml`, specify the path to your raw data: ```yaml raw_data_path: data/01_raw/your_dataset.csv ``` #### Step 9: Run the Pipeline Navigate back to the root of your Kedro project and run the pipeline: ```bash kedro run ``` #### Step 10: Review the Results Check the outputs in the `data` directory under the appropriate processing folder to see the results of your pipeline. This tutorial outlines the basic setup and steps to get started with a data preprocessing pipeline using Kedro. You can expand this pipeline by adding more complex data transformations, integrating different data sources, or creating additional pipelines for data analysis and modeling.

Looks like, to this day, ChatGPT still doesn't understand Kedro very well.