Closed tixxit closed 11 years ago
Review by @nuttycom
This looks useful, though, as far as I can see, it would be completely ineffective in preventing the "too many open files" problem we had recently -- those came at prolonged times of continuous activity.
So, I think that it's going to be necessary to add a bounded LRU cache so that we can be sure of having a fixed upper bound on the number of filehandles open. Otherwise, especially during reingest scenarios (which is where we encountered the problem) we're probably going to see the same issue again.
This adds the capability for NIHDB to go into an inactive (quiescent) state. Basically, the DB remains open, but it releases most of its resources (open files, raw log, etc.). This state is entered explicitly using the NIHDB#quiesce method. Any further methods will automatically force the resources back into existence.
Additionally, the VFS PathManagingActor now has an inactivity timeout that will call NIHDB#quiesce after a configurable (
storage.quiescence_timeout
, default 300) number of seconds on all NIHDB versions. This is implemented as just a receive timeout, a built-in part of Akka that will send a ReceiveTimeout message after a given period of inactivity.I'm not sure how to unit test this. However, I did fire up the shard, and put a resource through a cycle of quiescence and work and everything worked fine -- after 5 minutes (default timeout) the NIHDB would quiesce and subsequent queries/ingests would work just fine (then it would quiesce again, repeat, etc.)