unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 281 forks source link

How to Avoid Pandera Doc Injection? #1564

Open kernelpernel opened 1 month ago

kernelpernel commented 1 month ago

Question about pandera

We use pandera where I work for our dataframe schema. We also use sphinx to generate docs for our python libraries. Unfortunately, the documentation for pandera.pandera.api.pandas.container.DataFrameSchema keeps getting injected into our sphinx-generated documentation. As a work around, we have made most of our schema classes private to prevent doc importing.

We have also tried to write decorators for our own classes to sanitize the docs, but this has been challenging as well. Looking at the entire attribute stack for a class that inherits from pa.DataFrameSchema, most of the doc attributes appear empty. When we try to scrub docs from pandera modules, we end up without any of our own documentation and only have the cat, dog, duck example from pa.DataFrameSchema.

Is this a pandera bug? If not, is there a way that we could suppress the doc injection without removing our own documentation?

TL;DR: pandera is injecting documentation into our own documentation (especially from pandera.pandera.api.pandas.container.DataFrameSchema). Is there a way to prevent this from happening?

cosmicBboy commented 1 month ago

Thanks for bringing this up @kernelpernel, would it be possible to provide some screenshots and a minimally reproducible example? Don't really understand what you mean by docs being injected.

kernelpernel commented 1 month ago

No screenshots due to possible IP conflicts, but I put together this quick example:

For example, if I write this class:

class ExampleSchema(pa.SchemaModel):
    """Schema to demonstrate doc injection."""

    Column1: sc.Integer = sc.IntegerF()
    Column2: sc.Str = sc.StrF()

I get this output for the sphinx-generated docs:

class jane_dev.options.utils.doc_testing.ExampleSchema(*args, **kwargs)

   Bases: "pandera.api.pandas.model.DataFrameModel"

   Schema to demonstrate doc injection.

   Check if all columns in a dataframe have a column in the Schema.

   Parameters:
      * **check_obj** (*pd.DataFrame*) -- the dataframe to be
        validated.

      * **head** -- validate the first n rows. Rows overlapping with
        "tail" or "sample" are de-duplicated.

      * **tail** -- validate the last n rows. Rows overlapping with
        "head" or "sample" are de-duplicated.

      * **sample** -- validate a random sample of n rows. Rows
        overlapping with "head" or "tail" are de-duplicated.

      * **random_state** -- random seed for the "sample" argument.

      * **lazy** -- if True, lazily evaluates dataframe against all
        validation checks and raises a "SchemaErrors". Otherwise,
        raise "SchemaError" as soon as one occurs.

      * **inplace** -- if True, applies coercion to the object of
        validation, otherwise creates a copy of the data.

   Returns:
      validated "DataFrame"

   Raises:
      **SchemaError** -- when "DataFrame" violates built-in or custom
      checks.

   Example:
   Calling "schema.validate" returns the dataframe.

   >>> import pandas as pd
   >>> import pandera as pa
   >>>
   >>> df = pd.DataFrame({
   ...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
   ...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"]
   ... })
   >>>
   >>> schema_withchecks = pa.DataFrameSchema({
   ...     "probability": pa.Column(
   ...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
   ...
   ...     # check that the "category" column contains a few discrete
   ...     # values, and the majority of the entries are dogs.
   ...     "category": pa.Column(
   ...         str, [
   ...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
   ...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
   ...         ]),
   ... })
   >>>
   >>> schema_withchecks.validate(df)[["probability", "category"]]
      probability category
   0         0.10      dog
   1         0.40      dog
   2         0.52      cat
   3         0.23     duck
   4         0.80      dog
   5         0.76      dog

   Column1: pandera.typing.pandas.Series[pandas.core.arrays.integer.Int64Dtype] = 'Column1'

   Column2: pandera.typing.pandas.Series[str] = 'Column2'

   class Config

      Bases: "pandera.api.pandas.model_config.BaseConfig"

      name: str | None = 'ExampleSchema'

         name of schema

Where I would expect to only see this:

class jane_dev.options.utils.doc_testing.ExampleSchema(*args, **kwargs)

   Bases: "pandera.api.pandas.model.DataFrameModel"

   Schema to demonstrate doc injection.

   Column1: pandera.typing.pandas.Series[pandas.core.arrays.integer.Int64Dtype] = 'Column1'

   Column2: pandera.typing.pandas.Series[str] = 'Column2'

And the docs appear to be the same as those from here: Pandera Docs

kernelpernel commented 1 month ago

Thanks for the quick response @cosmicBboy !

cosmicBboy commented 1 month ago

It's probably because of the __new__ method: https://github.com/unionai-oss/pandera/blob/main/pandera/api/dataframe/model.py#L127-L132

Can you try overriding that method and seeing if it happens?

cosmicBboy commented 6 days ago

@kernelpernel any updates on this issue?