markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

PyEMMA config, logging and trajectory indexing cache #760

Closed franknoe closed 8 years ago

franknoe commented 8 years ago

Address issues/policies related to creation of files that log, write config or cache trajectories. Since these issues are closely related, I summarize them here. Feel free to split them into separate issues if more appropriate.

  1. Creation and setting of the PyEMMA configuration file: Optional. By default there is no PyEMMA configuration file and default settings should be used (for example, progress bars on if user is in an IPython environment, and otherwise off). Different configuration settings will be considered if the user has explicitly created a PyEMMA config file under ~/.pyemma/ or has explicitly requested to write that file by something like pyemma.config.save(). By default, PyEMMA should not try to create a PyEMMA directory or write a config file. In order to inform the user, the state of the configuration setting should be logged (at a level that is visible by default) when PyEMMA is imported. For example: PyEMMA v2.1. No configuration file found, defaults are used or PyEMMA v2.1. Using configuration found at ~/.pyemma/pyemma.cfg
  2. Trajectory index cache: By default trajectory caching is done in memory and therefore not permanent / only valid in this interpreter session. If config flag is set and sqllite is available (thread-safeness!), then persistent caching information will be created under ~/.pyemma. If either of these conditions are not met, or storage creation is unsuccessful, caching information will be be stored persistently during this session. We need a way to clear up the cache. Suggested solution: use a queue for cached entries and maximize the total cache size (e.g. 100 MB).
  3. Logfiles: File Logging is optional and should only be activated by setting a config flag.

The overall picture is that PyEMMA should only try to create files with useful information if explicitly demanded. Even in this case, creation of files should always be attempted with a fallback option, such that we don't generate crashes when we can't access the disk. However, in order to make it easy for the average user to change settings, I suggest a global "look and feel" switch such as pyemma.config.make_my_life_easy(default=False) (please find better name). If set to false, I suggest reporting that at import time (visibly), such that it's obvious how to activate easy life.

marscher commented 8 years ago

A comment to point 2: after reading https://www.sqlite.org/howtocorrupt.html I'm not sure whether it is a good idea to try storing this information on a shared file system.

franknoe commented 8 years ago

I'm very open to drop the database / persistent storage for now. If it can work robustly, it's nice to have of course. @clonker what do you think?

Am 14/04/16 um 21:23 schrieb Martin K. Scherer:

A comment to point 2: after reading https://www.sqlite.org/howtocorrupt.html I'm not sure whether it is a good idea to try storing this information on a shared file system.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/760#issuecomment-210109579


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented 8 years ago

BTW, Point 3: What does currently happen if multiple PyEMMA processes work in the same directory? I guess they won't write into the same logfile, will they?

clonker commented 8 years ago

Since newer versions of sqlite (sqlite3), it is supposed to be thread safe (see, e.g., here). When opening a connection to the database in python, there should be a named argument check_same_thread which should prevent corruption in a multithreaded environment. A simpler approach would maybe to have a small tool that generates such a database for a fixed set of trajectories which then can be loaded into pyemma..

marscher commented 8 years ago

Thread safety is not the crucial point here, but multi-process (write-) access. There you will most likely corrupt the database file because file locking is not well implemented on network filesystems like NFS. To point 3: Things would also be messed up here, since if the log file is in the same shared location, multiple processes will open the log file for writing. So we need to add a unique pattern to the log file name to avoid this.

marscher commented 8 years ago

All issues are now fixed by #774. I've turned of the persistence of the cache completely for the upcoming version. Before changing this again, we need to discuss whether it is better to have this handled by TrajDB.

franknoe commented 8 years ago

Sorry for being unclear - at least we should discuss this issue in some detail before pushing a release.

Am 18/04/16 um 20:07 schrieb Martin K. Scherer:

All issues are now fixed by #774 https://github.com/markovmodel/PyEMMA/pull/774. I've turned of the persistence of the cache completely for the upcoming version. Before changing this again, we need to discuss whether it is better to have this handled by TrajDB.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/760#issuecomment-211506392


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented 8 years ago

Am 18.04.2016 um 20:26 schrieb Frank Noe:

Sorry for being unclear - at least we should discuss this issue in some detail before pushing a release. Right. I thought you want me to push it ASAP, since it fixes a critical problem.

https://github.com/markovmodel/PyEMMA/issues/775#issuecomment-211293446

  • So is the trajectory cache only in memory now, or is there some way to activate it? How does that behavior affect the calculations @gph82 mentioned? Yes, he will not suffer from exceptions due to cache access now since it is only stored in memory.
  • Is the new behavior even tested in practical use? Please clarify. We have a lot of test cases. The cache used to life in memory before persistence got into play. So it is just a switch back to the old behaviour - so the answer should be from my point of view yes. If you ask for a relase candidate cycle or whatever for a service release, which should be tested intensively, then please tell me in advance.
franknoe commented 8 years ago

Am 18/04/16 um 20:46 schrieb Martin K. Scherer:

Am 18.04.2016 um 20:26 schrieb Frank Noe:

Sorry for being unclear - at least we should discuss this issue in some detail before pushing a release. Right. I thought you want me to push it ASAP, since it fixes a critical problem.

https://github.com/markovmodel/PyEMMA/issues/775#issuecomment-211293446 I know, but I didn't mean to cut short on testing and discussion of issues - otherwise we just introduce new problems.

  • So is the trajectory cache only in memory now, or is there some way to activate it? How does that behavior affect the calculations @gph82 mentioned? Yes, he will not suffer from exceptions due to cache access now since it is only stored in memory. Right, but is there any way to make the cache persistent now, or is this switched off? If I understood Guille correctly, it was important for his application that he has the possibility to use the trajectory cache, and we had discussed to switch it on optionally.
  • Is the new behavior even tested in practical use? Please clarify. We have a lot of test cases. The cache used to life in memory before persistence got into play. So it is just a switch back to the old behaviour - so the answer should be from my point of view yes. If you ask for a relase candidate cycle or whatever for a service release, which should be tested intensively, then please tell me in advance. What I meant is - has anyone at least installed and worked with this version.

Let's play with it a bit before releasing this patch.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/760#issuecomment-211521979


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany