Open tanzeyy opened 1 month ago
Hey, I am trying to contribute to Ray and I am willing to try this as my first issue. I wanted to understand what you exactly want with the output or what exactly you are looking to solve from this issue. As far as I understand you want Karen to group by and then provide a sorted list? Correct me if I am wrong
import ray
import pandas as pd
data = [
{"tag": "kitty", "number": 12},
{"tag": "karen", "number": 1},
{"tag": "karen", "number": 3},
{"tag": "times", "number": 2},
]
ds = ray.data.from_items(data)
def process_data(df):
return (df.groupby("tag")
.agg({"number": "sum"})
.reset_index()
.sort_values("number", ascending=False))
results = (
ds.to_pandas()
.pipe(process_data)
)
print(results)
This code should give you the output of groupby
as well as sorted
Hey, I am trying to contribute to Ray and I am willing to try this as my first issue. I wanted to understand what you exactly want with the output or what exactly you are looking to solve from this issue. As far as I understand you want Karen to group by and then provide a sorted list? Correct me if I am wrong
import ray import pandas as pd data = [ {"tag": "kitty", "number": 12}, {"tag": "karen", "number": 1}, {"tag": "karen", "number": 3}, {"tag": "times", "number": 2}, ] ds = ray.data.from_items(data) def process_data(df): return (df.groupby("tag") .agg({"number": "sum"}) .reset_index() .sort_values("number", ascending=False)) results = ( ds.to_pandas() .pipe(process_data) ) print(results)
This code should give you the output of
groupby
as well assorted
Thank you for your reply!
I need to process data at a scale of hundreds of gigabytes, so the initial ds.to_pandas()
approach might not be feasible for me. Also, my workflow involves complex processing on the grouped data, which is why I intended to use map_groups
.
Additionally, I managed to work around this issue by returning an arrow table in the function provided to map_groups
, but it should be a bug related to the data structures used in sort
operation I think.
Thank you for the acknowledgment. I am glad you found a workaround. If you are working on a large dataset instead of pandas you can use Modin as well to scale the processing. https://modin.readthedocs.io/en/latest/
Hi @scottjlee, could I take this one? I think we probably we miss some code to handle panda.Dataframe in sort function
I did some initial research on this, here are my thoughts:
File "/Users/xingyulong/Desktop/virtualenvs/ray_dev/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py", line 145, in reduce
return BlockAccessor.for_block(mapper_outputs[0]).merge_sorted_blocks(
File "/Users/xingyulong/Desktop/virtualenvs/ray_dev/lib/python3.9/site-packages/ray/data/_internal/arrow_block.py", line 602, in merge_sorted_blocks
blocks = [b for b in blocks if b.num_rows > 0]
File "/Users/xingyulong/Desktop/virtualenvs/ray_dev/lib/python3.9/site-packages/ray/data/_internal/arrow_block.py", line 602, in <listcomp>
blocks = [b for b in blocks if b.num_rows > 0]
File "/Users/xingyulong/Desktop/virtualenvs/ray_dev/lib/python3.9/site-packages/pandas/core/generic.py", line 5902, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'num_rows'
mapper_outputs
is the list of pyarrow.Table
blocks (which is why we use arrow_block.py afterwards) and somehow, the inside data was DataFrame and block type was pyarrow.Table
, so we called arrow_block.py.
Instead, we should invoke code at here https://github.com/ray-project/ray/blob/a1be06346e532e656c128ffd2230f06a4d72679e/python/ray/data/_internal/pandas_block.py#L504-L518
I have a few questions:
results = (
ds.groupby("tag")
.map_groups(lambda g: g, batch_format="pandas")
@scottjlee could you give some thoughts on this?
mapper_outputs is the list of pyarrow.Table blocks (which is why we use arrow_block.py afterwards) and somehow, the inside data was DataFrame and block type was pyarrow.Table, so we called arrow_block.py.
Yes, this sounds like the right analysis of the bug here. The core problem is that in this line, we assume that all of the block types are the same, and we grab the first block to get its corresponding BlockAccessor
.
My initial thoughts for the fix is to add a block normalization call like:
blocks = TableBlockAccessor.normalize_block_types(blocks, format)
to the beginning of SortTaskSpec
, where format
would be the batch/block format from its previous operator.
- what's the data format would be with following code
results = ( ds.groupby("tag") .map_groups(lambda g: g, batch_format="pandas")
In this case, the output block format would be pandas
DataFrames.
- if we add sort, do we have data format info along the way? (I didn't find it in codebase)
Sort
method shouldn't be specifying any other specific data type, it should use the data type from previous operator.
Thanks @xingyu-long !
@scottjlee Thanks for the clarification! Could you assign this to me? I can give it a try. Thanks!
Hi @scottjlee, after spending more time on this issue, I found out after map_groups, we will create empty pyarrow tables and later we use apply udf (https://github.com/ray-project/ray/blob/ee320aa6670514b85e77fce8d6903affcf883cc4/python/ray/data/grouped_data.py#L223) with batch data (so we don't update empty table). so one place we may update beforehand is the beginning of the map_groups(...)
function with TableBlockAccessor.normalize_block_types(blocks, format)
https://github.com/ray-project/ray/blob/ee320aa6670514b85e77fce8d6903affcf883cc4/python/ray/data/grouped_data.py#L192
However, I have trouble to get all blocks there, and don't know how to replace the blocks within the Dataset
.
Could you give some hints on this? or if it's not the place you wanted, could you elaborate it a little more? Thanks!
What happened + What you expected to happen
I am working on handling grouped data and sorting on certain column, and I want to use pandas operations to process the grouped data. However, it seems
sort
operation doesn't work withmap_groups(..., batch_format='pandas')
.Reproduction process:
Run above script will produce
AttributeError: 'DataFrame' object has no attribute 'num_rows'
:After changing
batch_format
topyarrow
, above code would work properly:Moreover, remove
sort
and it will also work:Versions / Dependencies
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.