Send dataframe API - Githubissues

Famok commented 1 month ago

Describe the solution you'd like I'd like to send dataframes (e.g. pandas and/or arrow) at once. They have the same timeline but multiple columns (e.g. time, x, y, z), whereas most often the index is the time either in us, seconds or pd.TimedeltaIndex. Great would be something like:

send_dataframe( base_entity_path = 'mydataframe',
             timeline = 'mytimeline',
             data = df, 
             time_column:Union[None,str]= 'index',  # None would always select the index
             columns:Union[None, List[str]] = ['x','y']                 # None would select all columns
            )

Describe alternatives you've considered Sending each column in separate calls. This works but might generate more overhead then necessary.

abey79 commented 1 month ago

If I understand correctly, your proposed API would result in the following data being logged:

entity mydataframe/x with index timestamps and a component with df["x"] as content,
entity mydataframe/y with index timestamps and a component with df["y"] as content,

both on the mytimeline timeline.

Is that correct?

In general, having a dataframe-based API is very good fit for our new columnar stuff. I see at least two points here:

If the send_dataframe API ends up logging to multiple "sub-entities" (as I think you suggest here), there would be little performance gain w.r.t separate send_columns calls. Chunks (our new fundamental data structure) always apply to a single entity, so multiple chunks would need to be emitted here in any case. (This is not to say that a convenience API wouldn't be useful.)
If the send_dataframe API logs column to a single entity, but different components, then we'd need to figure out a mapping from Python-side column dtype/label to component type (with the restriction that each components of a single entity must have a unique type). In particular, your example seems ambiguous as to what component type should be used.

Famok commented 4 weeks ago

Creating subentities seems to be rhe easiest way.

I can't see how the second option would work, I don't know enough about the inner workings of rerun.

But maybe there is a third if there was a datatframe entity type? Or is that against the design principles?

rerun-io / rerun

Send dataframe API #7204