Closed raviqqe closed 3 months ago
to be completely honest i never much cared about this problem. my tactic has always been to start processes in such as way as to avoid this issue cropping up. in practice the http process never crashes (ive not seen it) so as long as its up before workers start completing work you dont miss anything. this works pretty nicely in eg kubernetes and other similar systems in that we have added a status endpoint that a system should wait until it returns 200 before sending live traffic to the http server. since the workers and server have to be up and running to reply to /status you wont miss any responses that you care about if you dont send traffic until that endpoint returns 200.
anyway im not opposed to this if we are sure there aren't any drawbacks. the one you noted about not being able to recv from any random person publishing to the loopback is not something we need to worry about, i cant see how that would be a useful feature. however its been a long time since i read all the zmq documentation and i feel like i should familiarize myself again before we change the socket type. truthfully i wish i could remember why i didnt just use a push pull configuration from the beginning and opted to put that TODO in there about trying it. it was just too long ago to pull out of my memory..
@kevinkreiser Thank you for your insights! I'm gonna see if someone at @mapbox can help reviewing this PR.
i'll review it, no one at mapbox has push rights to this repo anyway
Thank you for your review! Should we do the same for interrupt
too?
Thank you for your review! Should we do the same for interrupt too?
definitely not. the interrupt socket is used to cancel work when the client closes the connection before receiving a response. to cancel the work the server then needs to use the interupt socket to signal any workers to give up on the work if they are indeed working on the job that has been canceled. if you switched it from pub/sub (server/worker) to push/pull then only one of the workers would get the signal and we cannot gaurantee which worker it will be. maybe worker A is working on the request that is to be canceled but worker B is the one who gets the interrupt signal. in this case worker B would just ignore it because its not working on that job at the moment. in otherwords we need the broadcast functionality of the pub/sub relationship for this feature to work proprly.
macos has been broken for a little while and i wasnt able to fix it so ignoring that and merging anyway
Thank you so much for the maintenance work and detailed explanation! :partying_face:
This PR prevents HTTP requests to be absorbed into
http_server_t
by replacing pub/sub sockets with push/pull ones for HTTP server loopback on, for example, the initial launch of microservices built withprime_server
.This change should also allow us to remove the following fast-fail directives used in flaky tests in Valhalla. At least, I've never seen these tests' random failures after this change on my local machine and Circle CI.
https://github.com/valhalla/valhalla/blob/9725261d1c33d6fa361d3d0eec12404f79ca150d/test/loki_service.cc#L611
https://github.com/valhalla/valhalla/blob/9725261d1c33d6fa361d3d0eec12404f79ca150d/test/skadi_service.cc#L245
Questions
I'm not familiar with all the use cases of
prime_server
. In my understanding, now that we use a different type of sockets for HTTP request/response loopback, we can't send messages from random clients but only from workers with loopabck sockets properly configured to the HTTP server's loopback socket. Is that fine?