multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

Validation output is different in Mac and Linux #27

Closed deercoder closed 3 years ago

deercoder commented 4 years ago

I'm using the same code and schema, but it seems that the output under different OS (Mac or Linux) is very different, I think it's related with the code or bug, could you please give some suggestions? Thanks!

Mac output: ['row: 2, column: "person_id" does not contain a valid value\n', 'row: 3, column: "end_date" does not contain a valid value\n', 'row: 4, column: "end_date" does not contain a valid value\n', 'row: 5, column: "end_date" does not contain a valid value\n', 'row: 6, column: "end_date" does not contain a valid value\n']

Linux output: ['row: 2, column: "person_id" does not contain a valid value\n', 'row: 2, column: "start_date" does not contain a valid value\n', 'row: 3, column: "start_date" does not contain a valid value\n', 'row: 3, column: "end_date" does not contain a valid value\n', 'row: 4, column: "start_date" does not contain a valid value\n', 'row: 4, column: "end_date" does not contain a valid value\n', 'row: 5, column: "end_date" does not contain a valid value\n', 'row: 6, column: "start_date" does not contain a valid value\n', 'row: 6, column: "end_date" does not contain a valid value\n']

For linux output, seems that the start_date is always invalid. I'm using the same version pandas_schema==0.3.4 version.

multimeric commented 4 years ago

Hmm that's very strange. I wonder if it relates to pandas being compiled differently or something related. I can't do much with just the warnings, but would you mind finding the minimal code (including the DataFrame you're validating) that causes these errors? Ideally add them as tests, then I'll add OSX as a travis test so it picks up stuff like this.

deercoder commented 4 years ago

Thanks for your reply. Yes, I spent some time on it and find that it's related with the datetime validation, I'm using this to validate ISO8601 date or other date, it works perfectly on Mac, but not on Linux. On Mac it seems all the ISO8601 pass the validation, but on Linux they all fail. Column('start_date', [DateFormatValidation('%Y-%m-%dT%H:%M:%S.%f%z') | DateFormatValidation('%Y-%m-%d')]),

After I changed it to this one, both of them works fine and pass the validation, I'm not sure why it happens like this: Column('start_date', [DateFormatValidation('%Y-%m-%dT%H:%M:%S.%fZ') | DateFormatValidation('%Y-%m-%d')]), Some test data is like 2013-06-14T04:00:00.000Z, 2014-06-29T04:00:00.000Z. These are simple validation rules. But I found the issue is just on this date validation.

multimeric commented 4 years ago

Sorry, somehow I didn't get a notification that you replied. Thanks for the examples, they should make tracking this down much easier.

multimeric commented 4 years ago

Upon further inspection, this is an issue throughout the Python ecosystem, because strptime depends on the platform's C library implementation. This is discussed here. I'm not certain which field in the format string is causing this exact issue however. If you work it out, you might be able to make a DateFormatValidation subclass that uses some platform specific behaviour to work around this issue. I don't think it's something I'll fix in PandasSchema, though, because it's really a Python bug, and I'm hesitant to pull in a date parsing library just for this. I could possibly be convinced to, though.