reubano / meza

A Python toolkit for processing tabular data
MIT License
416 stars 32 forks source link

type casting assumes 'month first' for ambiguous dates #16

Closed amirouche closed 2 years ago

amirouche commented 6 years ago

Right now, the type detection does infer a date, datetime or time types without taking into account the fact that 01/02/2002 can be both Februrary the 1st or January the 2nd depending on the date format used respectively DD/MM/YYYY and MM/DD/YYYY.

This might be undecidable in some rare cases, but in general it's possible given enough values to decide between both formats.

One possible way, to handle this in meza is to use a higher level datatype for representing the type of a field to replace the current string representation. For instance:

datetime_type = namedtuple('DateTimeType', ['format'])

Basically, use a representation that takes optional extra information about the type.

reubano commented 6 years ago

Good point. In fact, the date format detection could be implemented in a way similar to that of type detection. I.e., test dates record by record until a given confidence threshold has been crossed.

reubano commented 6 years ago

To clarify, process.type_cast calls convert.to_date and convert.to_datetime, and those functions then eventually call dateutil.parser.parse. parse takes an optional dayfirst parameter which defaults to False. This should be configurable, and optionally detected by a new function.

amirouche commented 2 years ago

:100: