markovmodel / adaptivemd

A python framework to run adaptive Markov state model (MSM) simulation on HPC resources
GNU Lesser General Public License v2.1
18 stars 7 forks source link

API change Staging for generators is also a simple task #11

Open jhprinz opened 7 years ago

jhprinz commented 7 years ago

I think that makes sense in our case, but not necessarily for RP compatibility.

What is the problem?

Currently a generator (openmmengine, etc.) oft requires the same files for every run. Like openmm needs system.xml, a .pdb file etc. And of course we do not always create these files again, but just store these once on the cluster and then use the staging_area which is this location. So, the stagin_area are files that we need to copy once and then access often, but are not results.

RP allows you to run a special stage_in command to push these files to the cluster before you run tasks. For we run into the problem that now every worker will copy these files to make sure that they exist. And we got already problems that require potential creating the same file several times and need try-except clauses to make sure this works.

There are 2 choices to fix that.

  1. Forget about the staging area and just always copy the file from the DB (which is already possible)

  2. Use the staging_area, but consider this to be a separate task. Practically the first one and all others depend on it. And the first running worker will pick the staging task. Create all staging files and once this is done. We can continue with the regular tasks. Still need to think about this, if there are other problems, but this seems more consistent and simpler.

nsplattner commented 7 years ago

I'm in favour of option 1. Some system-specific files are always needed, but its not a big overhead to copy them (they are usually small compared to the results files). This makes the writing and debugging of MD-code specific scripts easier. Its not necessary to store these files in the end, so its always just a temporary redundancy of information.

jhprinz commented 7 years ago

Yes and no. Imagine that there are 1000 workers on Titan running and every time they start a trajectory they request about 15MB from the DB each. If the DB is running here at FU (which is possible) then you need to transfer 15GB before you can even start. I guess mechanism 2 will just minimise transfer to the cluster. Well, 15MB is on the very large end, but still.

I agree, 1. is much simpler and would be one transfer per job. Currently it is 1 transfer per worker and option 2 would be 1 transfer per project.

nsplattner commented 7 years ago

I think we will have to test this on different remote resources in order to know whether this is a problem. However, in general I think there will be other more important problems to this, e.g. the continuous access to the database due to restrictive access policies for some clusters.

If this is a problem what could also help is to have a directory on the remote cluster from where workers can copy system files. This is probably the most basic functionality of a staging area. For this functionality the only change is that workers need the option to get the path to that directory from the DB and copy all files from this directory to their working directory.

jhprinz commented 7 years ago

Well, the staging option already works and was there before the DB one. It is part of the original RP implementation and to be compatible we might still need this.

My main point was the question, if we want to support a special "Now move all the staging files" function and mechanism or do we just implement this as a simple task, that will only copy the files, but not run the code.

I guess if I would have changed that you would not have noticed. Only that there is one more task in the list.

nsplattner commented 7 years ago

I think a simple task is sufficient. Maybe just add an example of how this can be done.