This PR improves how we summarize DataFrames and Series. The enhanced summarization allows GPT to understand the data structure and content better. This summarized format takes up about ~1,600 tokens, depending largely on how big the fields are in the sample. Long text fields do pose a problem. However, that's no different than the current implementation! Another thing that is hopefully a help to GPT -- we specify what level of sampling we're doing as well as the real size.
Key Changes:
Added summarize_dataframe and summarize_series functions to generate detailed summaries for DataFrames and Series respectively.
The summary for DataFrames includes:
Number of rows and columns
Column information (name, data type, missing values, and percentage of missing values)
Basic summary statistics for numerical and categorical columns
A sample of the data (configurable number of rows and columns)
The summary for Series includes:
Number of values
Data type
Missing values and their percentage
Summary statistics (based on the data type)
A sample of the data (configurable number of values)
This PR improves how we summarize
DataFrame
s andSeries
. The enhanced summarization allows GPT to understand the data structure and content better. This summarized format takes up about ~1,600 tokens, depending largely on how big the fields are in the sample. Long text fields do pose a problem. However, that's no different than the current implementation! Another thing that is hopefully a help to GPT -- we specify what level of sampling we're doing as well as the real size.Key Changes:
summarize_dataframe
andsummarize_series
functions to generate detailed summaries forDataFrame
s andSeries
respectively.DataFrame
s includes:Series
includes:Example output
Dataframe Summary
Number of Rows: 806
Number of Columns: 14
Column Information
Numerical Summary
Categorical Summary
Sample Data (5x14)