pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.57k stars 17.9k forks source link

Feature Request: "Skiprows" by a condition or set of conditions #32072

Open devrimcavusoglu opened 4 years ago

devrimcavusoglu commented 4 years ago

Code Sample, a copy-pastable example if possible

# data file path: foo/bar/data.csv
  col_1 col_2  ...  col_n
0   940    45  ...  0.023
1  1040    52  ...  0.430
2  1530    45  ...  0.302
3   753    43  ...  0.450
4   890    32  ...  0.023

schema={
    "col_1": int,
    ...,
    "col_n": float
}
mask = (col_1 >= 1000)
df = pd.read_csv('foo/bar/data.csv', skiprows=mask, dtype=schema)

Problem description

Pandas read methods currently support skipping rows by index with the parameter skiprows. It'd be a good feature if usage of skiprows is extended in way that it can take conditional statements just like many pandas objects. I anticipate that this feature may not be useful to all, it is certain that it will ease the people's pain who are dealing with many (large) files. Memory usage would drastically be downed in some situations I think.

I am opening this issue with hoping a welcome on dev side, not opening this because it first glanced as a fancy request, but because I think it may affect many people's work in a good way. It may be later used as a schema reference as well. Similar to schema validation of a data set.

NOTE: I am uncertain how can this be achieved or even if it can be done at all. It may not apply to pandas module. Consider this also as a brain storming.

Expected Output

This requires passing column names & dtype (schema) before reading the file, but I am not sure how to convey the mask for skiprows it should be similar to pandas.DataFrame condition & masks, but we may only pass column names since there is no pandas.DataFrame object.

  col_1 col_2  ...  col_n
0  1040    52  ...  0.430
1  1530    45  ...  0.302

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.1.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.3 numpy : 1.18.0 pytz : 2019.2 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0 Cython : 0.29.12 pytest : 5.3.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.0 lxml.etree : 4.3.4 html5lib : None pymysql : 0.9.3 psycopg2 : None jinja2 : 2.10.1 IPython : 7.6.1 pandas_datareader: None bs4 : 4.8.0 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.3.4 matplotlib : 3.1.1 numexpr : 2.6.9 odfpy : None openpyxl : 2.6.2 pandas_gbq : None pyarrow : None pytables : None s3fs : 0.4.0 scipy : 1.3.1 sqlalchemy : None tables : 3.5.2 xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.2.0
MarcoGorelli commented 4 years ago

Thanks @devrimcavusoglu

Could you show an example with input, and example of mask, and your expected output?

TomAugspurger commented 4 years ago

This kind of "predicate pushdown" seems out of scope for pandas. What do @WillAyd and @gfyoung think?

@devrimcavusoglu read_parquet has some support for this with the "filters" argument.

jreback commented 4 years ago

it’s probably very easy to do this; i thought we actually already supported this

gfyoung commented 4 years ago

Yes, we should already support something like this (we document it pretty clearly).

TomAugspurger commented 4 years ago

Are you referring to skiprows? IIUC, this is requesting we skip based on the result of a callable applied the the parsed value in a column.

devrimcavusoglu commented 4 years ago

I've added the examples of input and expected output to make it more clear.

gfyoung commented 4 years ago

Ah, okay, I see (thanks for the clarification @devrimcavusoglu). Semantically, this is outside the scope of skiprows, which filters based on row number (and not on the row contents).

That being said, I don't want to necessarily shut this conversation down simply because it currently is out of scope. I'm open to having a discussion about expanding it.

devrimcavusoglu commented 4 years ago

Ah, okay, I see (thanks for the clarification @devrimcavusoglu). Semantically, this is outside the scope of skiprows, which filters based on row number (and not on the row contents).

That being said, I don't want to necessarily shut this conversation down simply because it currently is out of scope. I'm open to having a discussion about expanding it.

Thanks for the comment @gfyoung. My first intention and thought was to make things better in terms of performance (reducing memory pressure) and more user-friendly convention in terms of interaction (that would make easier to trim out part of data).

I'd also like to expand the discussion, and what can be done and what cannot. Maybe, it's possible to implement a similar thing, or something else emerged from this idea. I am a very active pandas user, but in terms of pandas core I am not that familiar :). As the discussion goes, I'd dive deeper into the source code and start exploring it.

gfyoung commented 4 years ago

My first intention and thought was to make things better in terms of performance (reducing memory pressure) and more user-friendly convention in terms of interaction (that would make easier to trim out part of data).

Those are certainly things you could contribute without any objections! That will also give you a chance to take a look how we implement skiprows (across engines).

TomAugspurger commented 4 years ago

Jumping out of the C parser to execute a Python callable will kill performance right?

@jorisvandenbossche do you know if the Arrow CSV reader plans to add support for filters like parquet?

Liam3851 commented 4 years ago

I've often had the use case, but we always do something like:

gen = pd.read_csv('foo/bar/data.csv', dtype=schema, chunksize=10000000)
df = pd.concat((x.query("col_1 >= 1000") for x in gen), ignore_index=True)

I think this gets at the main goals-- it keeps the parsing in C, and it pushes down the predicate to each chunk for subsetting before running the concat? Only downside is that setting a proper chunksize to fit in user's RAM given the data is on the user, while if the predicate were pushed down fully that wouldn't be necessary (but would come, I think, at potentially a lot of complexity to get to C-speed?).

jreback commented 4 years ago

@Liam3851 nice example

would you do a PR to the read_csv doc-string?

Liam3851 commented 4 years ago

@jreback Happy to, that said the read_csv docstring is super-long just holding the parameter usage. Perhaps the user guide or cookbook would be more appropriate?

jreback commented 4 years ago

yeah either of those are also ok

Liam3851 commented 4 years ago

Huh. It turns out the cookbook already has an entry for "Reading only certain rows of a CSV chunk by chunk"; but it's a link to a Stack answer from 6 years ago that uses the defunct .ix syntax, and the name could maybe use a change to something more about what we want to do (read a subset of data based on its contents) than how to do it (use chunks). I'll add a more modern example to the cookbook in that slot.

devrimcavusoglu commented 4 years ago

@jreback @Liam3851 We may also update and expand the answer on stack by making the changes, updating links, and adding an example usage of query like filter. Thus, cookbook would still be linked to the answer with an updated form. What do you think ?

Liam3851 commented 4 years ago

take

WillySong commented 4 years ago

Hey, @Liam3851 are you still working on this issue? If not could you please release it as my team is interested in implementing it.

WillySong commented 4 years ago

take

jorisvandenbossche commented 4 years ago

@jorisvandenbossche do you know if the Arrow CSV reader plans to add support for filters like parquet?

Yes, this is coming to pyarrow (it probably not get into the 0.17 release next week, but certainly the next).

So I think the way to go here is the work on enabling the pyarrow engine in read_csv, and a filter can be passed to pyarrow to actually filter while reading (https://github.com/pandas-dev/pandas/pull/31817).