thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
138 stars 61 forks source link

Writing in /tmp #80

Closed pb-cdunn closed 6 years ago

pb-cdunn commented 6 years ago

Three problems with writing temporary files into /tmp:

  1. File names can collide.
    • You probably could use something like https://linux.die.net/man/3/mkdtemp, even when the user supplies -P. But remember to rmdir too.
  2. Daligner calls system() without checking return values.
    • The result is that problems from (1) can be unnoticed.
  3. Errors should go to stderr, not stdout.
    • I haven't verified this one myself, but that's what I'm told. It made (2) hard to debug.
thegenemyers commented 6 years ago

Chris,

The purpose of writing files to "/tmp" is so that on an HPC cluster, 

these files and their subsequent sort and merge do not require the transmission of said files across the cluster switch fabric to the distributed file system (DFS), their creation there, etc. That is, it is designed to reduced traffic between the nodes and the central DFS, and not in any way to "hide them" as true temporary files.

 Recall in an earlier version, these individual thread files were 

explicitly part of the process, but the observation was that the files were too small and too numerous and for many HPC DFS servers this caused problems because the DFS master was getting overwhelmed. On our systems, the local file store associated with each node is "/tmp" and this is true of many such systems. For those for which it is not true you can set the correct location with -P.

 The only way for file names to collide is for someone to use the 

same root name for the databases in different file directories, e.g. something dumb like "DB". For a given DB, all the intermediate file names are unique with respect to each other as the involving distinct combinations of block and thread numbers. So I have no idea how someone created 1. unless they used the same name for two distinct DB's -- I don't think that is a particularly safe thing to do in general for lots of reasons. That said, I would agree that 1 and 2 are worth doing, especially 2. As for errors going to stderr, I reviewed all the code and could not find such an instance. -v output goes to stdout but I don't think this is what is being talked about and it is where I want such output to go. So whomever reported that an error didn't go to stderr, please have them substantiate it.

I'm going to release a significant upgrade to all the modules in the 

next week or two and that update will have 1 and 2 in them.

Thanks,
-- Gene

On 4/17/18, 7:05 PM, Christopher Dunn wrote:

Three problems with writing temporary files into |/tmp|:

  1. File names can collide.
  2. Daligner calls |system()| without checking return values.
    • The result is that problems from (1) can be unnoticed.
  3. Errors should go to stderr, not stdout.
    • I haven't verified this one myself, but that's what I'm told. It made (2) hard to debug.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/80, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNoOQgUBTzH653IpROp98IEeiH8t5ks5tpiDVgaJpZM4TYszv.

pb-cdunn commented 6 years ago

Sounds good, Gene.

Please note that the system-call check is needed in DAMASKER too. (Internally, we let DAMASKER depend on DALIGNER so LAsort and LAmerge become available.)

The only way for file names to collide is for someone to use the same root name for the databases in different file directories...

This happened in our Iso-Seq workflow, not Falcon. I think they do something multiple times. Renaming the DB might be a fair work-around, at least if they use symlinks.

As for errors going to stderr, I reviewed all the code and could not find such an instance.

Thanks for checking. I suspected that was a user error.

thegenemyers commented 6 years ago

Hi Chris,

I have modified daligner so it creates a temporary directory at -P 

that uses the process id of the job in it so as to be unique. All the intermediate files are then placed in that directory, and care is taken to remove everything, even if an error occurs mid-process.

I have also similarly  modified datander and damapper, which is what 

I think you are referring to when you say "DAMASKER".

These changes and several others that simplify the HPC scripts will 

be checked in to github soon -- I am having Martin give it a shake down to make sure no unexpected bugs have been introduced before doing so.

-- Gene

On 4/21/18, 4:47 PM, Christopher Dunn wrote:

Sounds good, Gene.

Please note that the system-call check is needed in DAMASKER too. (Internally, we let DAMASKER depend on DALIGNER so LAsort and LAmerge become available.)

The only way for file names to collide is for someone to use the
same root name for the databases in different file directories...

This happened in our Iso-Seq workflow, not Falcon. I think they do something multiple times. Renaming the DB might be a fair work-around, at least if they use symlinks.

As for errors going to stderr, I reviewed all the code and could
not find such an instance.

Thanks for checking. I suspected that was a user error.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/80#issuecomment-383302503, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNi6_8PAM8ttaLY0aMzDgaogXCVL6ks5tq0ZxgaJpZM4TYszv.