Open jdb78 opened 3 years ago
Let me first say how much I love pandera and the appreciate the effort to maintain and write such a great package!
thanks @jdb78!
I think this would be a fantastic addition to pandera! I think before delving more into implementation details, I'd like to understand a little bit better what additional context would be helpful.
I understand the reasoning behind title
and description
: basically it provides additional annotation capabilities that gives the user more human-readable metadata. However,
Now, would it not be better for the user to understand why the column should pass this check?
Implies something more that I'm not sure how adding a title
and description
to fields would solve... or am I missing something? A concrete code sketch/example with realistic (but toy) data often helps with these kinds of discussions... basically I think it would help illustrate what the limitations are of the current state of pandera and how the solution would address them.
Currently, pandera reports the failure cases that were the reason behind the failed check (in the case that the check returns a boolean vector), e.g. here, and also gives the user access to a dataframe of all the failure cases across all the checks via lazy validation, see here.
P.S. #329 adds support for field aliases in via the class-based API
I am also interested in this enhancement. In my case, I'm looking to pair it with the to and from yaml capabilities of pandera. The extra metadata supporting a title and/or a description will turn the produced yaml files into a full-fledged data dictionary.
I would like to read something like the below yaml file (note I also think a title/description to the DataFrame schema as well is a good idea).
schema_type: dataframe
version: {pa.__version__}
title: Example Schema
description: This is a description of the DataFrame schema ....
columns:
column_1:
description: Column 1 represents ....
pandas_dtype: int
nullable: false
checks:
greater_than: 0
allow_duplicates: true
coerce: false
required: true
regex: false
column_2:
description: Column 2 represents ....
pandas_dtype: float
....
I will have capacity to submit a pull request, but before I go about generating one, I wanted to pick your brain on what you were thinking.
As field attribute -
...
columns:
column_1:
field:
title: Column 1 Title
description: Column 1 represents ....
pandas_dtype: int
nullable: false
checks:
greater_than: 0
allow_duplicates: true
coerce: false
required: true
regex: false
As standalone attributes -
columns:
column_1:
title: Column 1 Title
description: Column 1 represents ....
pandas_dtype: int
nullable: false
checks:
greater_than: 0
allow_duplicates: true
coerce: false
required: true
regex: false
Also, thank you for the time you've put toward pandera!
Dustin
Awesome, thanks @dustindall!
Are you thinking about having a field object that holds both the title and description like pydantic or were you open to adding the title and description as attributes under the Column/Index?
I think all schemas and schema components should have a title and description, so that would be DataFrameSchema
, SeriesSchema
, Column
, Index
, and MultiIndex
in the object-based API.
title
and description
should probably also be available in the model.SchemaModel
and model.components.Field
in the class-based API.
The implementation should be fairly straightforward since these attributes shouldn't really effect any part of the validation process. The three parts I can think of are:
SchemaError
or SchemaErrors
exception, the title and description should probably be part of the error message somehow.__str__
/__repr__
methods: sort of related to (1), but these dunder methods should also include title
and description
I will have capacity to submit a pull request
That would be awesome! It might make sense if you added the title
and description
attributes to all relevant class definitions, and maybe part (3) as you're interested in this part of the enhancement. If you're up for it, (2) would also be fairly straighforward... (1) might be a little bit of a pain, as the way error-handling is done in the code-base is a little confusing and would stand to be improved from a developer perspective. Let me know what you feel like doing!
Also feel free to ping me here if you need any help re: setting up your dev environment.
Thanks @dustindall for being willing to submit a PR :) Some pointers for the SchemaModel
compatibility:
Field is just a function that generates a FieldInfo which constructs a Column/Index
for SchemaModel.to_schema()
. You could add title
and description
to Field/FieldInfo
and pass those info when constructing the column/index.
We need to add a title
/description
to Config
Maybe it's just me, but I feel like the distinction between DataframeSchema.name
(technical name) and DataframeSchema.title
(reporting name) is very thin. I can see how it could be useful but the difference should be clearly explained in the doc.
but I feel like the distinction between DataframeSchema.name (technical name) and DataframeSchema.title (reporting name) is very thin
agreed. Is there a good reason (besides consistency) to have a title
and name
at the dataframe level? I think semantically they're different for columns/indexes since name
maps onto the column/index name so having a human-readable title
makes sense.
AFAIK, a dataframe only has a name
attribute in a Groupby.apply
function, where name
is the discrete groupby value.
One solution would be to deprecate name
in favor of title
to represent a "human-readable label for the dataframe schema", and raise a DeprecationWarning
when name
is provided at the dataframe-level.
Hey @tfwillems I saw you submitted #440, let me know if you're willing to make a PR for this :)
It should be pretty straight-forward to add these properties to all the relevant objects, since they don't touch the validation parts of the library.
Let me first say how much I love pandera and the appreciate the effort to maintain and write such a great package!
Is your feature request related to a problem? Please describe. Errors can be difficult to debug without extra context. Say, a check fails, the only information to the user is that the column failed that particular check. Now, would it not be better for the user to understand why the column should pass this check? A very simple way to do this is to give more context, i.e. more information about what this column actually contains, where it comes from, etc. This can also help to actually understand why the issue appears and how it can be debugged.
Describe the solution you'd like pydantic allows titles and descriptions, aliases, etc. in their
Field
object. I wonder if this could be added to pandera. Together with some smart sphinx plugin, you could even turn this information into a full and automatic data dictionary (see https://sphinx-pydantic.readthedocs.io/). Maybe it is possible to make Fields subclasses of the pydantic fields and schemas subclasses of BaseModel (not sure I am considering all challenges)?