Infer fixed width format using the whole file

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.04k stars 17.72k forks source link

Infer fixed width format using the whole file #21970

Open aplavin opened 6 years ago

aplavin commented 6 years ago

As of now read_fwf infers the fields positions using only first 100 rows of the file, and this number is not easily modifiable. However, if there is a field with values for several rare objects only, it will be completely missed! So it would be great if pandas used much more rows by default (100 is quite a small number) - why not put something like 10000? Or at least provide a way to increase this number and/or a parameter like infer_using_whole_file=False.

If anyone finds this issue and needs an immediate solution - I personally use monkey-patching:

_detect_colspecs = pd.io.parsers.FixedWidthReader.detect_colspecs
pd.io.parsers.FixedWidthReader.detect_colspecs = lambda self, n=100000, skiprows=None: _detect_colspecs(self, n, skiprows)

gfyoung commented 6 years ago

Indeed! Would be nice to parameterize how many rows you need to infer field positions.

aldanor commented 5 years ago

Yes, this messes thing up quite often and columns where strings get longer towards the end of the data may get messed up and truncated.

amotl commented 1 year ago

Hi there,

first things first: Thanks a stack for outlining this important detail, it just saved our implementation ^1. 🌻

Provide a way to increase this number and/or a parameter like infer_using_whole_file=False. Would be nice to parameterize how many rows you need to infer field positions.

On this matter, we wanted to report that the infer_nrows argument has apparently been added to read_fwf, so, while it still does not satisfy the need for a infer_using_whole_file flag, the original gist can be implemented like read_fwf(colspecs="infer", infer_nrows=100000) now, which may be more convenient than using the monkey patch, depending on the scenario.

With kind regards, Andreas.

amotl commented 1 year ago

the need for a infer_using_whole_file flag

It looks like using infer_nrows=np.Infinity works well, see https://github.com/earthobservations/wetterdienst/commit/a4b15125d.