openedx / platform-plugin-aspects

Aspects plugins for the Open edX LMS and Studio
Apache License 2.0
2 stars 8 forks source link

Data handling in dump_data_to_clickhouse #108

Open anfbermudezme opened 2 days ago

anfbermudezme commented 2 days ago

Current Behavior

The dump_data_to_clickhouse method in the Aspects plugin currently queries the database to retrieve the object before sending it to ClickHouse. This approach ensures the process is fully asynchronous and doesn’t block the original request.

While this design makes sense to prevent blocking, it introduces some potential challenges:

  1. Incomplete transactions: If the transaction saving the object hasn’t been committed yet, the task might fail because the object isn’t available in the database, leading to inconsistencies.
  2. Additional queries: Every save operation triggers an extra query to the database to fetch the object, even if this data is already accessible when dump_data_to_clickhouse is called. This could result in unnecessary overhead.
  3. Responsibility of data handling: This implementation assigns Aspects the responsibility of ensuring the data is ready and properly serialized, which might not align with its role as a sink.

Points for Consideration

While the current implementation aims to avoid breaking the original request in case of serialization or connectivity issues, I wanted to explore whether there’s room for optimizations. Here are some ideas to consider:

Questions

The current implementation works, but it may be worth discussing whether adjustments like retries or a shift in responsibility could improve its efficiency and maintainability. Thank you for considering this topic and for the work behind this plugin.

bmtcril commented 2 days ago

Hi @anfbermudezme, thank you for taking the time to write up this really thorough set of thoughts. There is definitely a lot we can do to improve the event sinks, I have always hoped that this implementation would be a temporary step and that we could use the event bus for this eventually, however adoption of the event bus has been slow and we don't want to force community adoption of it just for this purpose.

I'll try to address some of the reasoning behind why things are the way they are below, but would love to continue the conversation and find ways to improve the project.

Potential challenges

Incomplete transactions: If the transaction saving the object hasn’t been committed yet, the task might fail because the object isn’t available in the database, leading to inconsistencies.

We have mitigations for this that delay sending the task until the transaction is committed. This should be the default for all sinks triggered inside MySQL transactions (see here for an example) but is not currently, so that is a way that things can be improved for sure. This also prevents issues where data arrives in ClickHouse and the transaction rolls back, leading to inconsistencies.

Not all sinks are tied directly to MySQL transactions, however, and things such as handling course data on publish can have more complicated failure states as Mongo is involved as well.

Additional queries: Every save operation triggers an extra query to the database to fetch the object, even if this data is already accessible when dump_data_to_clickhouse is called. This could result in unnecessary overhead.

This is true, however in most cases we're talking about a primary key lookup in MySQL which is extremely efficient, so generally not a performance concern compared to everything else happening in the LMS. There are two other reasons it happens this way:

1- Some objects are too large to be stored in the Celery task table and would be truncated / lost if we tried to store the full serialized object. We could work around this by storing the serialized data somewhere else, but then incur a lot of complexity and new race conditions.

2- Speaking of race conditions, the Celery tasks we use to send this data do not have a guaranteed order. By sending the primary keys we lose some granularity in the ClickHouse data when someone saves multiple times in quick succession, but the cost of losing those intermediate rows seems worth it if we are guaranteed to have the most recent rows in ClickHouse instead of potentially having an intermediate row accidentally sent last and overwriting the state with out of data data.

Responsibility of data handling: This implementation assigns Aspects the responsibility of ensuring the data is ready and properly serialized, which might not align with its role as a sink.

This is true as well, and the separation of concerns is a difficult one to untangle. Since Aspects is not a required or even default component of the Open edX stack, it did not seem appropriate for other systems to have any knowledge of ClickHouse or the opinionated serialization needed for serializing its data there. Since Aspects controls the ClickHouse schema it has seemed to make more sense for it to take on the burden of making the serialization choices it needs for efficient storage.

Points for Consideration

Retries: Introducing a retry mechanism for the task to handle scenarios where the transaction hasn’t committed yet. This would improve consistency while maintaining the current asynchronous behavior.

I think that for transaction scenarios our best bet is to make sure every sink that can have one is set up to send only trigger the sink on commit. Retries in general are definitely desirable. We have shied away from using Celery retries very much out of concern for scenarios like ClickHouse going down, Celery backing up on retries for many events, and other core services that share Celery with us failing or degrading due to that. This is an area where we could definitely improve, however!

Directly sending serialized data: Modifying the behavior so dump_data_to_clickhouse accepts serialized data and the target table, rather than querying the database for the object. This would reduce the need for additional queries but would require changes to how sinks are currently integrated.

I think I covered this above, but am happy to talk about it more.

Questions

Is the current behavior considered a core part of the plugin’s design, or is there interest in exploring alternative approaches?

I'd definitely be happy to explore alternatives to the way things currently work!

Would the suggested adjustments align with the plugin’s goals, and what would be the implications for existing integrations?

I don't think that storing serialized data for every is a thing we would want to pursue based on the reasons above. I think that examining ways that we could handle retries better and making sure that all MySQL based models execute on commit would be good wins for us.

Many of these challenges may go away in an event bus implementation (and new ones added, no doubt), but we still don't seem very close to that.

I'm happy to continue the conversation here, or do a video call, or even carry this forward in a data working group meeting if you like. Thanks again!