sillsdev / machine

Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages.
MIT License
26 stars 15 forks source link

Clean up inconsistent states in MongoDB #158

Closed johnml1135 closed 4 months ago

johnml1135 commented 9 months ago

MongoDB or the job server can crash at any time. We should have a periodic job to clean it up. This is really about build states getting stuck.

johnml1135 commented 7 months ago

Here is a plan:

ddaspit commented 7 months ago

I don't think we can switch all running jobs to pending on Machine startup. For example, a job can be running just fine on ClearML even if Machine restarts. I would really like to avoid adding any gRPC endpoints to deal with this issue. It isn't the responsibility of Serval to deal with the inconsistencies. It should be the sole responsibility of engine. There are many ways that this issue could be dealt with depending on how the engine is implemented. I want to give freedom to the engine to handle the inconsistencies in the best way possible.

johnml1135 commented 6 months ago

I updated the original description to reference machine-job and to keep jobs running if it can. Only if that restarts should there really be an issue.

As for another gRPC endpoint, Serval may not know that a job is complete (even though machine tried to reach it). As far as machine knows, the job is complete but when a user asks for the status, Serval doesn't check with Machine, but just returns it's incorrect status. Do you have another way of resolving or syncing these two data sources?

ddaspit commented 6 months ago

The inconsistency can occur, because, when a job completes, Machine needs to update the database and Serval using gRPC. These two operations need to be atomic, but they aren't. This is a common issue for distributed systems. Luckily, there is a pattern to handle this issue, called the transactional outbox pattern. Basically, we perform the database update and write the gRPC message to a database outbox in a single transaction. There is a separate process that monitors the database outbox and actually sends the message to Serval. This guarantees that the message is sent eventually even if Serval is down.

johnml1135 commented 6 months ago

Yes, the transactional outbox looks a bit more elegant - I'll work on implementing it then instead of the other GPRC endpoint.

ddaspit commented 6 months ago

I thought about using Hangfire a bit more and realized that it won't work. We want to be able to update the outbox and the model in the same transaction. If we use Hangfire, the outbox would be the Hangfire queue, which is stored in a separate database. So there is no way to update the Machine database and the Hangfire database in a single transaction. I think this means that we will need to implement our own transactional outbox.

ddaspit commented 6 months ago

Here is a sample implementation of the transactional outbox pattern for .NET.