Open drobison00 opened 1 year ago
I think i understand how it would translate, but I'd really like to see an example that's framed around a stage, specifically the map operation of the stage (where by "map" I mean self.process_message
in mrc.operators.map(self.process_message))
).
@cwharris I think I might need a bit more context on your question. In the context of usage in a stage, here is a snippet from the dfp_training code, and how it would be adapted to produce the same behavior as before:
Before
while (control_message.has_task("training")):
control_message.remove_task("training")
user_id = control_message.get_metadata("user_id")
message_meta = control_message.payload()
with message_meta.mutable_dataframe() as dfm:
final_df = dfm.to_pandas()
model = AutoEncoder(**model_kwargs)
# Only train on the feature columns
train_df = final_df[final_df.columns.intersection(feature_columns)]
model.fit(train_df)
After
while (control_message.has_task("training")):
control_message.remove_task("training")
user_id = control_message.get_metadata("user_id")
final_df = control_message.payload().read()
model = AutoEncoder(**model_kwargs)
# Only train on the feature columns
train_df = final_df[final_df.columns.intersection(feature_column s)]
model.fit(train_df)
In the context of having a very large, backed data source, something like this would probably be more appropriate:
After -- as_data_loader
while (control_message.has_task("training")):
control_message.remove_task("training")
user_id = control_message.get_metadata("user_id")
df_loader = control_message.payload().as_data_loader(feature_columns) # Note, this isn't in the pseudo code.
model = AutoEncoder(**model_kwargs)
# Only train on the feature columns
model.fit(df_loader)
Question: Since reading/writing might require IO, would it be beneficial to have async versions of the CRUD operations? On the C++ side, I imagine we'd release the gil. That allows other threads to continue processing, but without an async API in Python, we're blocking the current thread from reading more messages while any CRUD is in-flight.
@cwharris Yes, for most operations we'll likely want to drop the GIL or perform them asynchronously. I won't say this is 100% true, because for some record managers, we may utilize something like fsspec via pybind for the initial implementation.
In the context of a python call, we should see other python threads have a chance to grab the GIL and start executing as soon as we drop it in the c++ function. Similarly, when the c++ function issues a syscall for something like a file or network read, we should see it yield as well and let some other thread execute.
If I'm understanding the context correctly, the only thing we're likely to be blocked on is processing more messages within the current python node; I don't know of any current situations where this would be a problem.
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
High
Please provide a clear description of problem this feature solves
This feature will support three primary goals:
Describe your ideal solution
Pseudocode
Data Manager Object
Management Object
Base DataRecord class
Derived in-memory class
Derived on-disk record class
Payload Manager Object
This example assumes we're working with DataFrames, this does not have to be the case long term, but serves to illustrate the process.
Additional context
Example use case(s)
Suppose we take our existing ControlMessage object and update it to use our PayloadManager structure.
Single object payload with no backing medium -- we don't need to supply a data_object_id
Single object payload with a backing medium -- we don't need to supply a data_object_id
Code of Conduct