oda-hub / dispatcher-app

Other
2 stars 2 forks source link

live tests generate identical non-aliased job ids #710

Closed volodymyrss closed 1 month ago

volodymyrss commented 2 months ago

See details in https://cdci.sentry.io/issues/5791157252/?project=1467382&query=is%3Aunresolved&referrer=issue-stream&statsPeriod=24h&stream_index=0

@burnout87 could you please have a look?

burnout87 commented 2 months ago

This has already happened sometime ago, and the reasons why have always been unclear.

burnout87 commented 2 months ago

Inspecting this job, in both instances, the jobs are "done", but two not-aliased scratch dirs have been created

volodymyrss commented 2 months ago

I suppose it's possible it's a race condition between checking that directory exists and creating it. Do the inspected directories have creation time close to each other?

burnout87 commented 2 months ago

One is 1725364545.0859604 , and the other is 1725364544.7069209 . So yes, very close

burnout87 commented 1 month ago

It happened again https://cdci.sentry.io/issues/5935424353/?notification_uuid=085601ab-b6ac-443a-b91e-b00c3f7d038b&project=1467382

And also this time, it looks the consequence of the same type of race condition

volodymyrss commented 1 month ago

Yeah, thanks for following up on it.

We need a way for dispatcher to safely recover from this situation. Could you propose something?

burnout87 commented 1 month ago

An approach that can help us is based on using a retry mechanism to handle the situation more "gracefully"?

Eventually, we could even implement a lock-based approach: in particular, we'd use a lock file to ensure that only one process can create a directory at a time.

I just did some research, and figured the library fcntl can be used for file-lock functionality.

What do you think?

burnout87 commented 1 month ago

Actually, I think, an approach that uses a lock is going to be more effective

volodymyrss commented 1 month ago

Lock sounds good to me, thanks.