thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
138 stars 61 forks source link

Race condition (probabily) due to temp files #90

Closed a-ludi closed 4 years ago

a-ludi commented 4 years ago

There occurs some race condition when I execute daligner on a cluster running several instances in parallel. The effect is that some of the .las files get corrupted (various defects).

IMHO, the only explanation for this can be the temporary files because there is no other shared resource between the instances than the file system. I could not track down the bug but here is what I would suggest in an attempt to fix it:

Use mkstemp instead of PID for single temporary files if any. For the merge step I would suggest using mkdtemp instead and placing the intermediate files below that directory. This makes the handling easier. Should this not fix the bug, we can be fairly sure it is not related to files.

gt1 commented 4 years ago

The mkstemp and mkdtemp functions may work great on a local file system, but as you are on a cluster and see collisions they may not be a good choice. These functions require some form of file system locks, which are often not well supported on network file systems. If you want to avoid file name collisions, I would suggest to use the -P switch to set a suitable unique temporary file place for each daligner instance.

a-ludi commented 4 years ago

This is definitely an important hint. I will try using -P to fix the race condition.

However, the race condition seems to occur between instances of daligner on the same node meaning that the temporary files are stored locally. Thus, mkstemp and mkdtemp should be working perfectly. At the same time, using PIDs locally should also be working perfectly as far as I know.

In any case, I believe there is no better option (without explicit user interactions) to avoid race conditions with temporary files than these functions.

thegenemyers commented 4 years ago

Arne,

 Daligner makes the temporary file from the block number and thread 

number of the process using the file. So I do not see how the same file names can be used by different process. The only way this would break is if you are running jobs involving the same blocks on the same node. Please advise.

-- Gene

On 10/9/19, 11:09 AM, Arne wrote:

This is definitely an important hint. I will try using |-P| to fix the race condition.

However, the race condition seems to occur between instances of |daligner| on the same node meaning that the temporary files are stored locally. Thus, |mkstemp| and |mkdtemp| should be working perfectly. At the same time, using PIDs locally should also be working perfectly as far as I know.

In any case, I believe there is no better option (without explicit user interactions) to avoid race conditions with temporary files than these functions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/90?email_source=notifications&email_token=ABUSINSWPIA4H3NSFSTI7DLQNWNV7A5CNFSM4I635B3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAXGQ3I#issuecomment-539912301, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUSINRXCPHZCNVMY46APUDQNWNV7ANCNFSM4I635B3A.

a-ludi commented 4 years ago

I (probably) found the bug in my own code:

I called daligner in symmetric mode (without -A) on a single database but also called daligner for both directions. So, this can lead to a race condition on the output files, e.g.:

daligner database.1 database.2
# -> database.1.database.2.las
# -> database.2.database.1.las

daligner database.2 database.1
# -> database.2.database.1.las
# -> database.1.database.2.las

This causes corrupted output when both instances start writing output to the same file.

This fix is, of course, to call daligner for one direction only!

thegenemyers commented 4 years ago

Very good, glad you found it. -- G

On 10/10/19, 11:57 AM, Arne wrote:

I (probably) found the bug in my own code:

I called |daligner| in symmetric mode (without |-A|) on a single database but also called |daligner| for both directions. So, this can lead to a race condition on the output files, e.g.:

|daligner database.1 database.2

-> database.1.database.2.las

-> database.2.database.1.las

daligner database.2 database.1

-> database.2.database.1.las

-> database.1.database.2.las

|

This causes corrupted output when both instances start writing output to the same file.

This fix is, of course, to call |daligner| for one direction only!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/90?email_source=notifications&email_token=ABUSINSYP57DPRMTANK43VDQN337LA5CNFSM4I635B3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA3UYDI#issuecomment-540494861, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUSINX2QGN76YDYC35LBBTQN337LANCNFSM4I635B3A.