rgbkrk / genai

What if GPT could help you notebook?
BSD 3-Clause "New" or "Revised" License
351 stars 36 forks source link

Enhanced Summarization for DataFrames and Series #75

Closed rgbkrk closed 1 year ago

rgbkrk commented 1 year ago

This PR improves how we summarize DataFrames and Series. The enhanced summarization allows GPT to understand the data structure and content better. This summarized format takes up about ~1,600 tokens, depending largely on how big the fields are in the sample. Long text fields do pose a problem. However, that's no different than the current implementation! Another thing that is hopefully a help to GPT -- we specify what level of sampling we're doing as well as the real size.

Key Changes:


Example output

Dataframe Summary

Number of Rows: 806

Number of Columns: 14

Column Information

Column Name Data Type Missing Values % Missing
0 w3alcd object 0 0
1 doing_business_as object 0 0
2 restaurant_address object 0 0
3 inspection_date datetime64[ns] 117 14.5161
4 major_violation_improper_holding_temperature int64 0 0
5 minor_violation_improper_holding_temperature int64 0 0
6 major_violation_inadequate_cooking int64 0 0
7 minor_violation_inadequate_cooking int64 0 0
8 major_violation_personal_hygiene int64 0 0
9 minor_violation_personal_hygiene int64 0 0
10 major_violation_contaminated_equipment int64 0 0
11 minor_violation_contaminated_equipment int64 0 0
12 major_violation_unsafe_food_source int64 0 0
13 minor_violation_unsafe_food_source int64 0 0

Numerical Summary

Column Name count mean std min 25% 50% 75% max
0 major_violation_improper_holding_temperature 806 0.0111663 0.105144 0 0 0 0 1
1 minor_violation_improper_holding_temperature 806 0.10794 0.310498 0 0 0 0 1
2 major_violation_inadequate_cooking 806 0 0 0 0 0 0 0
3 minor_violation_inadequate_cooking 806 0 0 0 0 0 0 0
4 major_violation_personal_hygiene 806 0 0 0 0 0 0 0
5 minor_violation_personal_hygiene 806 0 0 0 0 0 0 0
6 major_violation_contaminated_equipment 806 0 0 0 0 0 0 0
7 minor_violation_contaminated_equipment 806 0.0694789 0.254425 0 0 0 0 1
8 major_violation_unsafe_food_source 806 0 0 0 0 0 0 0
9 minor_violation_unsafe_food_source 806 0 0 0 0 0 0 0

Categorical Summary

Column Name count unique top freq first last
0 inspection_date 689 550 2018-10-10 00:00:00 4 2011-01-23 00:00:00 2023-01-17 00:00:00

Sample Data (5x14)

doing_business_as w3alcd inspection_date minor_violation_unsafe_food_source restaurant_address minor_violation_inadequate_cooking major_violation_personal_hygiene major_violation_contaminated_equipment major_violation_improper_holding_temperature minor_violation_personal_hygiene major_violation_unsafe_food_source minor_violation_contaminated_equipment minor_violation_improper_holding_temperature major_violation_inadequate_cooking
15 AFC SUSHI @ SAFEWAY #691 FA0001354 2019-02-28 00:00:00 0 1444 SHATTUCK AVE, BERKELEY, CA 0 0 0 0 0 0 0 0 0
769 VIK'S CHAAT CORNER FA0000567 2016-05-18 00:00:00 0 2390 FOURTH ST , BERKELEY, CA 0 0 0 0 0 0 0 0 0
325 GYPSY'S TRATTORIA ITALIANO FA0000674 2016-12-20 00:00:00 0 2519-A DURANT AVE, BERKELEY, CA 0 0 0 0 0 0 0 0 0
220 CVS PHARMACY FA0001247 2018-10-26 00:00:00 0 2655 TELEGRAPH AVE, BERKELEY, CA 0 0 0 0 0 0 0 0 0
419 LE BATEAU IVRE/DRUNKEN BOAT FA0000547 2022-08-26 00:00:00 0 2629 TELEGRAPH AVE , BERKELEY, CA 0 0 0 0 0 0 0 0 0