mskcc / smile-server

2 stars 4 forks source link

Properly Handle MetaDB Publishing Failures #24

Closed n1zea144 closed 3 years ago

n1zea144 commented 3 years ago

How to handle when messages are published but NATS server is down. What happens? What are our procedures for making updates to NATS server, CMO MetaDB app, etc. Add our own log file to publish failed messages (LIMS), maybe log request id to a specific file and that is shared with some tool, maybe our publisher standalone tool.

ao508 commented 3 years ago

Started looking into file stores and sql store. These would facilitate recovery if the MetaDB app is down but NATS itself is still running.

The notes below aren't super well organized, I've just copied and pasted them from my evernote:

What are best practices for handling instances where clients attempt to publish to a NATS server that is not online?

What to do if server is down and client attempts to publish?

How should we recover messages that failed to publish? Store the request ids somewhere and use the standalone tool? Should the whole request json get stored somewhere and then imported via file loading?

* Might be able to use a file store or sql store in the event that the connection is lost
* https://docs.nats.io/nats-streaming-server/configuring/persistence
* New published messages will be rejected with an "invalid publish request"
    * Connection Lost handlers will be notified that the connection is lost 
* Acknowledgements
    * Maybe turn the publishing event into a request-reply with the concept of an acknowledgement, such as a simple empty message (no payload)
n1zea144 commented 3 years ago

I like the idea of using a store, it is another system to maintain, but maybe unavoidable. I'm trying to avoid a request-reply/synchronized exchange and does it address the problem? That is, if the request-reply does not return, don't we still end up in the same situation? Having said that, a request-reply to check that NATs is running before attempting to publish may make for a clean handling of when NATs is down.