Large numbers of temporary files created

harry-thorpe / piggy

Pipeline for analysing intergenic regions in bacteria

GNU General Public License v3.0

37 stars 7 forks source link

Large numbers of temporary files created #8

Closed johnlees closed 4 years ago

johnlees commented 7 years ago

Hi Harry, I got a run on 96 isolates with ~2000 IGRs each to work, and got some nice output!

However when I tried on a larger set of ~1800 isolates I ran into an issue due to a temporary file being created for every unaligned IGR, making around 4x10⁶ files. Our file system unfortunately can't cope with that many. Would it be possible to create fewer files and index them, or even leave the IGRs in memory rather than write to disk?

Thanks for your help, John

harry-thorpe commented 7 years ago

Hi John,

Glad it worked on the smaller dataset!

I know it has been run on up to 1000 isolates without problems, but yes now you have pointed that out it is a bit of a limitation (and probably very bad design!). Unfortunately it does create a file for each IGR (I thought putting them in separate folders would protect against this, but obviously not). I'll look into a solution to produce fewer files, but this may take a bit of time.

Thanks,

Harry

harry-thorpe commented 7 years ago

Hi John,

I have written a quick fix for this (v1.1). I now create a file per isolate (not IGR), thus reducing the number of files hugely. I then search through this file to find the IGR of interest. This is obviously not efficient, and so it is a bit slower. I will sort out a better solution at some point, but I am on holiday at the moment and I wanted it to be usable! Could you try it again on the larger dataset and let me know how it goes?

Thanks,

Harry

johnlees commented 7 years ago

Sorry for the slow reply. We're currently having an issue with the file system that's stopping me from testing, but just to let you know I'm getting a warning message at the start

"my" variable $dh masks earlier declaration in same scope at /nfs/users/nfs_j/jl11/installations/piggy/piggy line 194.

I'll get back to you when I can run on the full set.

Cheers, John

harry-thorpe commented 7 years ago

Thanks John. I've fixed this now.

johnlees commented 7 years ago

Looks like it's working on the big set now using about 70000 files (~50 times fewer than before). Thanks for the quick fix!

harry-thorpe commented 6 years ago

Hi John,

Did this fix work for you on many genomes? I realised I only tested it on a few, and it was super slow when I went to run it on ~1000. I have now improved the fix in the latest version and it is much faster.