multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

Question: Validation with Dataframe iterators #48

Closed bhavaniravi closed 3 years ago

bhavaniravi commented 3 years ago

When reading large files with pandas, we often set iterator true.

pd.read_csv(file_object, sep=self.sep, header=None, iterator=True, chunksize=CHUNK_SIZE)

I have a use case where I have to discard the processing of the whole file even if one value is corrput. I am leaning towards implementing a validator as a generator like the following.

def validate(df_iterator):
    for df in df_iterator:
        errors = schema.validate(df)
        if errors:
            raise DataInvalidError(f"Invalid data found, {errors}")
        yield df

The problem is the above case is if there is an error in the 3rd chunk it won't stop the first 2 from processing.

How to handle this?

multimeric commented 3 years ago

I think your current approach is a good one. I'm not sure how you could retrospectively validate data you've already yielded in any situation. Even in the simplest case where you are using a for loop (and not using this library at all), you will have to either validate all the data first and then return it as one big list, or use a generator and risk returning data which you later find isn't valid:

from typing import Generator, List

def generator_validate(data: List[int]) ->  Generator[int, None, None]:
    for item in data:
        if item % 3 == 0:
            raise DataInvalidError()
        else:
            yield item 

def list_validate(data: List[int]) -> List[int]:
    ret = []
    for item in data:
        if item % 3 == 0:
            raise DataInvalidError()
        else:
            ret.append(item)
    return ret

Unless I am understanding incorrectly?

bhavaniravi commented 3 years ago

You're right @TMiguelT, Thankfully my only job after validation is to insert into a DB, so I am using one atomic transaction for all inserts. That should do good for now.

multimeric commented 3 years ago

If you are using transactions, then could you not just rollback as soon as you hit a validation warning?

bhavaniravi commented 3 years ago

Yup that's the plan