Open devrimcavusoglu opened 4 years ago
Thanks @devrimcavusoglu
Could you show an example with input, and example of mask
, and your expected output?
This kind of "predicate pushdown" seems out of scope for pandas. What do @WillAyd and @gfyoung think?
@devrimcavusoglu read_parquet has some support for this with the "filters" argument.
it’s probably very easy to do this; i thought we actually already supported this
Yes, we should already support something like this (we document it pretty clearly).
Are you referring to skiprows
? IIUC, this is requesting we skip based on the result of a callable applied the the parsed value in a column.
I've added the examples of input and expected output to make it more clear.
Ah, okay, I see (thanks for the clarification @devrimcavusoglu). Semantically, this is outside the scope of skiprows
, which filters based on row number (and not on the row contents).
That being said, I don't want to necessarily shut this conversation down simply because it currently is out of scope. I'm open to having a discussion about expanding it.
Ah, okay, I see (thanks for the clarification @devrimcavusoglu). Semantically, this is outside the scope of
skiprows
, which filters based on row number (and not on the row contents).That being said, I don't want to necessarily shut this conversation down simply because it currently is out of scope. I'm open to having a discussion about expanding it.
Thanks for the comment @gfyoung. My first intention and thought was to make things better in terms of performance (reducing memory pressure) and more user-friendly convention in terms of interaction (that would make easier to trim out part of data).
I'd also like to expand the discussion, and what can be done and what cannot. Maybe, it's possible to implement a similar thing, or something else emerged from this idea. I am a very active pandas user, but in terms of pandas core I am not that familiar :). As the discussion goes, I'd dive deeper into the source code and start exploring it.
My first intention and thought was to make things better in terms of performance (reducing memory pressure) and more user-friendly convention in terms of interaction (that would make easier to trim out part of data).
Those are certainly things you could contribute without any objections! That will also give you a chance to take a look how we implement skiprows
(across engines).
Jumping out of the C parser to execute a Python callable will kill performance right?
@jorisvandenbossche do you know if the Arrow CSV reader plans to add support for filters like parquet?
I've often had the use case, but we always do something like:
gen = pd.read_csv('foo/bar/data.csv', dtype=schema, chunksize=10000000)
df = pd.concat((x.query("col_1 >= 1000") for x in gen), ignore_index=True)
I think this gets at the main goals-- it keeps the parsing in C, and it pushes down the predicate to each chunk for subsetting before running the concat? Only downside is that setting a proper chunksize to fit in user's RAM given the data is on the user, while if the predicate were pushed down fully that wouldn't be necessary (but would come, I think, at potentially a lot of complexity to get to C-speed?).
@Liam3851 nice example
would you do a PR to the read_csv doc-string?
@jreback Happy to, that said the read_csv docstring is super-long just holding the parameter usage. Perhaps the user guide or cookbook would be more appropriate?
yeah either of those are also ok
Huh. It turns out the cookbook already has an entry for "Reading only certain rows of a CSV chunk by chunk"; but it's a link to a Stack answer from 6 years ago that uses the defunct .ix syntax, and the name could maybe use a change to something more about what we want to do (read a subset of data based on its contents) than how to do it (use chunks). I'll add a more modern example to the cookbook in that slot.
@jreback @Liam3851 We may also update and expand the answer on stack by making the changes, updating links, and adding an example usage of query like filter. Thus, cookbook would still be linked to the answer with an updated form. What do you think ?
take
Hey, @Liam3851 are you still working on this issue? If not could you please release it as my team is interested in implementing it.
take
@jorisvandenbossche do you know if the Arrow CSV reader plans to add support for filters like parquet?
Yes, this is coming to pyarrow (it probably not get into the 0.17 release next week, but certainly the next).
So I think the way to go here is the work on enabling the pyarrow engine in read_csv, and a filter
can be passed to pyarrow to actually filter while reading (https://github.com/pandas-dev/pandas/pull/31817).
Code Sample, a copy-pastable example if possible
Problem description
Pandas read methods currently support skipping rows by index with the parameter
skiprows
. It'd be a good feature if usage ofskiprows
is extended in way that it can take conditional statements just like many pandas objects. I anticipate that this feature may not be useful to all, it is certain that it will ease the people's pain who are dealing with many (large) files. Memory usage would drastically be downed in some situations I think.I am opening this issue with hoping a welcome on dev side, not opening this because it first glanced as a fancy request, but because I think it may affect many people's work in a good way. It may be later used as a schema reference as well. Similar to schema validation of a data set.
NOTE: I am uncertain how can this be achieved or even if it can be done at all. It may not apply to pandas module. Consider this also as a brain storming.
Expected Output
This requires passing column names & dtype (schema) before reading the file, but I am not sure how to convey the mask for skiprows it should be similar to
pandas.DataFrame
condition & masks, but we may only pass column names since there is nopandas.DataFrame
object.Output of
pd.show_versions()