unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.38k stars 310 forks source link

Titles, aliases, and description for SchemaModels #331

Open jdb78 opened 3 years ago

jdb78 commented 3 years ago

Let me first say how much I love pandera and the appreciate the effort to maintain and write such a great package!

Is your feature request related to a problem? Please describe. Errors can be difficult to debug without extra context. Say, a check fails, the only information to the user is that the column failed that particular check. Now, would it not be better for the user to understand why the column should pass this check? A very simple way to do this is to give more context, i.e. more information about what this column actually contains, where it comes from, etc. This can also help to actually understand why the issue appears and how it can be debugged.

Describe the solution you'd like pydantic allows titles and descriptions, aliases, etc. in their Field object. I wonder if this could be added to pandera. Together with some smart sphinx plugin, you could even turn this information into a full and automatic data dictionary (see https://sphinx-pydantic.readthedocs.io/). Maybe it is possible to make Fields subclasses of the pydantic fields and schemas subclasses of BaseModel (not sure I am considering all challenges)?

cosmicBboy commented 3 years ago

Let me first say how much I love pandera and the appreciate the effort to maintain and write such a great package!

thanks @jdb78!

I think this would be a fantastic addition to pandera! I think before delving more into implementation details, I'd like to understand a little bit better what additional context would be helpful.

I understand the reasoning behind title and description: basically it provides additional annotation capabilities that gives the user more human-readable metadata. However,

Now, would it not be better for the user to understand why the column should pass this check?

Implies something more that I'm not sure how adding a title and description to fields would solve... or am I missing something? A concrete code sketch/example with realistic (but toy) data often helps with these kinds of discussions... basically I think it would help illustrate what the limitations are of the current state of pandera and how the solution would address them.

Currently, pandera reports the failure cases that were the reason behind the failed check (in the case that the check returns a boolean vector), e.g. here, and also gives the user access to a dataframe of all the failure cases across all the checks via lazy validation, see here.

P.S. #329 adds support for field aliases in via the class-based API

dustindall commented 3 years ago

I am also interested in this enhancement. In my case, I'm looking to pair it with the to and from yaml capabilities of pandera. The extra metadata supporting a title and/or a description will turn the produced yaml files into a full-fledged data dictionary.

I would like to read something like the below yaml file (note I also think a title/description to the DataFrame schema as well is a good idea).

schema_type: dataframe
version: {pa.__version__}
title: Example Schema
description: This is a description of the DataFrame schema ....
columns:
  column_1:
    description: Column 1 represents ....
    pandas_dtype: int
    nullable: false
    checks:
      greater_than: 0
    allow_duplicates: true
    coerce: false
    required: true
    regex: false
  column_2:
    description: Column 2 represents ....
    pandas_dtype: float
  ....

I will have capacity to submit a pull request, but before I go about generating one, I wanted to pick your brain on what you were thinking.

As field attribute -

...
columns:
  column_1:
    field:
      title: Column 1 Title
      description: Column 1 represents ....
    pandas_dtype: int
    nullable: false
    checks:
      greater_than: 0
    allow_duplicates: true
    coerce: false
    required: true
    regex: false

As standalone attributes -

columns:
  column_1:
    title: Column 1 Title
    description: Column 1 represents ....
    pandas_dtype: int
    nullable: false
    checks:
      greater_than: 0
    allow_duplicates: true
    coerce: false
    required: true
    regex: false

Also, thank you for the time you've put toward pandera!

Dustin

cosmicBboy commented 3 years ago

Awesome, thanks @dustindall!

Are you thinking about having a field object that holds both the title and description like pydantic or were you open to adding the title and description as attributes under the Column/Index?

I think all schemas and schema components should have a title and description, so that would be DataFrameSchema, SeriesSchema, Column, Index, and MultiIndex in the object-based API.

title and description should probably also be available in the model.SchemaModel and model.components.Field in the class-based API.

The implementation should be fairly straightforward since these attributes shouldn't really effect any part of the validation process. The three parts I can think of are:

  1. error reporting: when pandera raises a SchemaError or SchemaErrors exception, the title and description should probably be part of the error message somehow.
  2. __str__/__repr__ methods: sort of related to (1), but these dunder methods should also include title and description
  3. yaml/python script IO: as you mentioned, the new attributes should be added when calling the to_script and to_yaml methods.

I will have capacity to submit a pull request

That would be awesome! It might make sense if you added the title and description attributes to all relevant class definitions, and maybe part (3) as you're interested in this part of the enhancement. If you're up for it, (2) would also be fairly straighforward... (1) might be a little bit of a pain, as the way error-handling is done in the code-base is a little confusing and would stand to be improved from a developer perspective. Let me know what you feel like doing!

Also feel free to ping me here if you need any help re: setting up your dev environment.

jeffzi commented 3 years ago

Thanks @dustindall for being willing to submit a PR :) Some pointers for the SchemaModel compatibility:

Maybe it's just me, but I feel like the distinction between DataframeSchema.name (technical name) and DataframeSchema.title (reporting name) is very thin. I can see how it could be useful but the difference should be clearly explained in the doc.

cosmicBboy commented 3 years ago

but I feel like the distinction between DataframeSchema.name (technical name) and DataframeSchema.title (reporting name) is very thin

agreed. Is there a good reason (besides consistency) to have a title and name at the dataframe level? I think semantically they're different for columns/indexes since name maps onto the column/index name so having a human-readable title makes sense.

AFAIK, a dataframe only has a name attribute in a Groupby.apply function, where name is the discrete groupby value.

One solution would be to deprecate name in favor of title to represent a "human-readable label for the dataframe schema", and raise a DeprecationWarning when name is provided at the dataframe-level.

cosmicBboy commented 3 years ago

Hey @tfwillems I saw you submitted #440, let me know if you're willing to make a PR for this :)

It should be pretty straight-forward to add these properties to all the relevant objects, since they don't touch the validation parts of the library.