mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
172 stars 51 forks source link

How are results retrieved from workers #187

Open statquant opened 6 years ago

statquant commented 6 years ago

Hello, I came across this package that seems fabulous. I have a question, if this is not the right channel please let me know and I'll delete. I use slurm and have been using rslurm. problem I found is a that results on each workers are stored as Rdata file and reloaded on master. Even by changing this to fst files this is problematic and slow. Can I ask if batchtools is using files too in the retrieving process? Many thanks

jakob-r commented 6 years ago

There has to be a commonly shared directory. All results (the return of your algorithm function) will be saved there. So when everything is finished the results are already in the shared directory if you stick to the batchtools conventions. If your results are too big you should consider to reduce what you return in the algorithm function. For example, you should not return any data, job or instance variables because they can be retrieved later avoiding duplication. If the result can't be reduced then you simply have to deal with the big files.

mllg commented 6 years ago

I'm not sure if I completely understood your question. On most systems there is only one way to communicate the results back to the master: the file system.

RData is arguably not the most efficient format for many results, but at least it works for arbitrary objects. If you can provide more details what should be communicated back exactly in your use case, I can try to find a better solution.

statquant commented 6 years ago

Hello, many thanks for your answers @mllg my use case: I use a slurm cluster on linux (centos 7) boxes. Typically the output of the function sent to the workers are data.tables. RData: indeed works for all objects as rds (I would have expected the latter to be faster to load if used without compression but I might have been wrong). It is still very slow when it comes to data.frames (or data.table), I've "overridden" the storing type to use fst binary files and gained a lot and given that I think data.frame are likely to be the most common output I think it would make sense to offer this as an easy accessible option. what I was wishing for: I was kind of hoping that the results would be sent back to the master over the network through some flavor of IPC

mllg commented 6 years ago

RData: indeed works for all objects as rds (I would have expected the latter to be faster to load if used without compression but I might have been wrong). It is still very slow when it comes to data.frames (or data.table), I've "overridden" the storing type to use fst binary files and gained a lot and given that I think data.frame are likely to be the most common output I think it would make sense to offer this as an easy accessible option.

I will open a separate feature request to support fst files. On HPCs it is often faster to read compressed files (depending on the compression, of course). On network file systems the IO is usually the bottleneck, not the CPU.

what I was wishing for: I was kind of hoping that the results would be sent back to the master over the network through some flavor of IPC

I'm not aware of a standardized method to do this. Additionally, what to do if the master crashes? Are the results then lost?

statquant commented 6 years ago

I will open a separate feature request to support fst files. On HPCs it is often faster to read compressed files (depending on the compression, of course). On network file systems the IO is usually the bottleneck, not the CPU.

fst holds a compression/decompression library that will be far more efficient, you can "compress" your fst file with an index from 0 to 100. I do not get this I/O consideration as the workers are on distant machines so they will write on a network file system and then the data will have to transit through the network too ?

I'm not aware of a standardized method to do this. Additionally, what to do if the master crashes? Are the results then lost?

I have literally 0 knowledge about IPC and network in general, so excuse me if I am talking nonsense but wouldn't you be able to communicate through TCP for instance ?

mllg commented 6 years ago

fst holds a compression/decompression library that will be far more efficient, you can "compress" your fst file with an index from 0 to 100. I do not get this I/O consideration as the workers are on distant machines so they will write on a network file system and then the data will have to transit through the network too ?

Yes, that is completely right. But consider the following example: You have a 10GB result file and your CPU can compress/decompress with 2GB/s, but your network file system doesn't get faster than 1GB/s. You now have two options:

  1. Store the file uncompressed while your CPU mostly idles. Writing the file takes 10s.
  2. Use some CPU cycles to compress the file on-the-fly and thereby reduce the file size to 5GB. Writing the file takes 5s.

If you are writing to local SSDs it is the other way around, here the CPU is usually the bottleneck.

I have literally 0 knowledge about IPC and network in general, so excuse me if I am talking nonsense but wouldn't you be able to communicate through TCP for instance ?

Yes, that would in principle be possible. Still, you would need a running master process and have the risk to lose your results if the master crashes.