kevinkreiser / prime_server

non-blocking (web)server API for distributed computing and SOA based on zeromq
Other
59 stars 26 forks source link

Fix disappearing HTTP requests with the push/pull pipeline pattern in ZeroMQ #134

Closed raviqqe closed 3 months ago

raviqqe commented 3 months ago

This PR prevents HTTP requests to be absorbed into http_server_t by replacing pub/sub sockets with push/pull ones for HTTP server loopback on, for example, the initial launch of microservices built with prime_server.

This change should also allow us to remove the following fast-fail directives used in flaky tests in Valhalla. At least, I've never seen these tests' random failures after this change on my local machine and Circle CI.

https://github.com/valhalla/valhalla/blob/9725261d1c33d6fa361d3d0eec12404f79ca150d/test/loki_service.cc#L611

https://github.com/valhalla/valhalla/blob/9725261d1c33d6fa361d3d0eec12404f79ca150d/test/skadi_service.cc#L245

Questions

I'm not familiar with all the use cases of prime_server. In my understanding, now that we use a different type of sockets for HTTP request/response loopback, we can't send messages from random clients but only from workers with loopabck sockets properly configured to the HTTP server's loopback socket. Is that fine?

kevinkreiser commented 3 months ago

to be completely honest i never much cared about this problem. my tactic has always been to start processes in such as way as to avoid this issue cropping up. in practice the http process never crashes (ive not seen it) so as long as its up before workers start completing work you dont miss anything. this works pretty nicely in eg kubernetes and other similar systems in that we have added a status endpoint that a system should wait until it returns 200 before sending live traffic to the http server. since the workers and server have to be up and running to reply to /status you wont miss any responses that you care about if you dont send traffic until that endpoint returns 200.

anyway im not opposed to this if we are sure there aren't any drawbacks. the one you noted about not being able to recv from any random person publishing to the loopback is not something we need to worry about, i cant see how that would be a useful feature. however its been a long time since i read all the zmq documentation and i feel like i should familiarize myself again before we change the socket type. truthfully i wish i could remember why i didnt just use a push pull configuration from the beginning and opted to put that TODO in there about trying it. it was just too long ago to pull out of my memory..

raviqqe commented 3 months ago

@kevinkreiser Thank you for your insights! I'm gonna see if someone at @mapbox can help reviewing this PR.

kevinkreiser commented 3 months ago

i'll review it, no one at mapbox has push rights to this repo anyway

raviqqe commented 3 months ago

Thank you for your review! Should we do the same for interrupt too?

kevinkreiser commented 3 months ago

Thank you for your review! Should we do the same for interrupt too?

definitely not. the interrupt socket is used to cancel work when the client closes the connection before receiving a response. to cancel the work the server then needs to use the interupt socket to signal any workers to give up on the work if they are indeed working on the job that has been canceled. if you switched it from pub/sub (server/worker) to push/pull then only one of the workers would get the signal and we cannot gaurantee which worker it will be. maybe worker A is working on the request that is to be canceled but worker B is the one who gets the interrupt signal. in this case worker B would just ignore it because its not working on that job at the moment. in otherwords we need the broadcast functionality of the pub/sub relationship for this feature to work proprly.

kevinkreiser commented 3 months ago

macos has been broken for a little while and i wasnt able to fix it so ignoring that and merging anyway

raviqqe commented 3 months ago

Thank you so much for the maintenance work and detailed explanation! :partying_face: