ouseful-PR / nbval

A py.test plugin to validate Jupyter notebooks
Other
0 stars 0 forks source link

Add support to compare pandas dataframe structures #1

Open psychemedia opened 2 years ago

psychemedia commented 2 years ago

Some cells may return a dataframe with a particular structure, but content that varies over runs. It would be useful to be able to check that the returned structure of a dataframe matches the structure of a previously returned dataframe.

For example, consider the dataframe:

import pandas as pd

course_dict = {'course_code': ['TM351', 'TU100', 'M269'],
              'points': [30, 60, 30],
              'study_level': ['3', '1', '2']
              }

course_df = pd.DataFrame(course_dict)
course_df

This gives html as follows:

"<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>alpha</th>\n",
       "      <th>num1</th>\n",
       "      <th>num2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
"""

We can read this into a data frame and then check things like its size, shape and column structure:

pd.read_html(df_html)[0].size, pd.read_html(df_html)[0].shape, pd.read_html(df_html)[0].columns.to_list()