Rishika Randev's Pandas Descriptive Script for IDS706 Week 3
☑️ Requirements (Mini Project 2 & Individual Project 1):
- Jupyter notebook performing descriptive statistics & tested with nbval plugin
- Python script for statistics and generating one data visualization
- Summary pdf or markdown file
- Makefile that installs required packages, formats, lints, and tests
- requirements.txt
- Python testing scripts
- Successful CI/CD badges
☑️ The Dataset
The dataset used in this project is a synthetic, free dataset from Kaggle called Student Performance Factors. It contains various columns that could potentially impact student performance on exams, such as hours studied, hours slept, class attendance, tutoring sessions, and family income. The full list of columns can be viewed at the link above.
☑️ Steps
- Prepare the necesary configuration files like the Dockerfile, devcontainer.json, Makefile, requirements.txt, and main.yml for GitHub Actions integration. Ensure that the requirements.txt lists all necessary packages (for example, matplotlib for visualizing and pandas for data manipulation).
- Create a main.py script with two functions--
- generate_summary_stats(csv): reads in any csv file passed to it into a pandas dataframe and then generates summary statistics (mean, median, mode, standard deviation) for its columns.
- generate_data_viz(csv): reads in the csv file, creates a scatterplot of Hours Studied vs. Exam Score using matplotlib, and saves it as a png file (performance.png).
-
Create a test_main.py script with two functions--
- test_generate_summary_stats(csv): calls generate_summary_stats() using the student performance factors csv file to validate a few of the sample statistics generated by this function.
- Create a Jupyter Notebook with the same code as the main.py script to easily show the outputs of the descriptive statistics and data visualization.
-
Using the main.yml file, set up a GitHub Actions workflow so that every time changes are pushed to the repository, all of the Makefile commands are run to ensure that new code is properly formatted using Black, linted using Ruff, and tested using Pytest. A pdf or markdown summary file can also be generated through GH Actions (or it can be manually pushed to the repository, by converting the Jupyter notebook to html / pdf).
☑️ Summary File
The outputs of the descriptive statistics and visualization showing Hours Studied vs. Exam Scores are captured in this pdf file.