Closed Theodlz closed 1 month ago
@nabeelre just so you know that issue is being addressed.
we might want to keep using a shorter timeout though (maybe 10 s) to try to minimize these concurrently issues when 2 alerts are being triggered on at once (for the same object).
create a no-retry version of the api_skyportal method, to avoid running into concurrency issues when the session retries sending a request to SkyPortal (because it took longer than the specified time out to respond to the client) when the initial one is still being processed. This is fine and can safely happen for pretty much everything, except follow-up requests.
If we tell SkyPortal to trigger an instrument, after 5 seconds without a response (the default timeout) decide to resent that request but SkyPortal was almost done (and is still) processing it, we might end up sending 2 requests at the same time and creating duplication issues.
With this PR, we can use much longer timeouts (here we try 30 seconds) while avoiding any retries when sending a follow-up request.
PS: We can still run into a concurrency issue of course (where 2 alerts of the same object get processed at the same time, and worker B tries to trigger on alert 2 at the same time as Worker A is triggering on alert 1) and we already have logic in SkyPortal to avoid that, but if the distant server SkyPortal is sending the request to is taking too long is becomes a risk. In a future set of SkyPortal+Kowalski+Fritz PRs, we can consider posting the request in the DB in a "processing" state as soon as possible so that even before we start waiting for a distance server to answer, other processed can know that we are actively trying to send an identical request and they should cancel sending anything.