Performance improvement: Environment representation

Acaccia commented 3 years ago

Is your feature request related to a problem? Please describe. Using the benchmarks developed in the context of issue #332 and PR #333 , I generated flamegraphs for each of those benchmarks. On most of those, the time is spent for the majority of the Nickel evaluation on the function Thunk::into_closure, which induces a huge amount of thunk cloning, and cloning of Hashmaps in _evalclosures. Both of these seems related to the cloning of Environment.

Describe the solution you'd like Environment could be a chained list of environments, with the outer ones shared and creating a new environment would add a new one on head of the list, something like

struct Environment {
  current: HashMap<Ident, Thunk>,
  previous: Option<Rc<RefCell<Environment>>>,
}

This should make the cloning cheaper, by sharing the outer environments. However, it will make the lookup more expensive, so this solution should be benchmarked and compared with the actual implementation.

Additional context The flamegraphs can be generated from the branch benches_and_flamegraphs. just use cargo bench -- --profile-time=<time> to generate these. For my flamegraps, I used time = 5. (I do not think this branch will be merged, since it adds a new dependency on pprof and changes the criterion profiler)

This issue can be seen from flamegraphs of benches like functions/church 3 or lists/generate normal|deepseq 30, lists/map normal|deepseq 30, ...

church_3 generate_deepseq_30 map_deepseq_30

Acaccia commented 3 years ago

Also, using a DHAT profiling on this simple example:

let run = fun s =>
   let updateDict = fun dict char =>
     if records.hasField char dict then
       dict -$ char & {"#{char}" = dict."#{char}" + 1}
     else
       dict & {"#{char}" = 1} in
   lists.foldl updateDict {} (strings.chars s) in

run "abcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcd"

we can see that most of the allocations comes from the Hashmap cloning induced by Thunk::into_closure.

To generate this DHAT profiling, go to branch bin_with_dhat, copy the example in a file "countLetters.ncl" and run cargo run --release --features=memprofile -- -f countLetters.ncl export Otherwise, here is the json to put in dh_view: dhat-heap.zip

Edit: for context, this profiling was made for understanding why the benchmark for records is OOM-killed directly, and the answer seem linked to this issue, so I'm using it here.

yannham commented 3 years ago

Thanks for the detailed report! Indeed the current representation of environments is not optimal.

I don't have much time to look at this right now, but just wanted to make a bunch of quick remarks:

On most of those, the time is spent for the majority of the Nickel evaluation on the function Thunk::into_closure [...]

It must be noted that Thunk::into_closure is responsible for extracting a term from a thunk before evaluating it. Because currently the RichTerm type uses owned pointers (that is, Box), any shared term has to be deep-copied by Thunk::into_closure. Granted, it also has to copy the environment of the closure. I don't know which cost dominates, you say later that the dhat profile shows that it is cloning hashmaps a lot, so maybe it's the environment copying. I just wanted to note that it is possible that the cost of Thunk::into_closure is dominated by copying the term itself, in which case the representation of the environment is orthogonal to the problem, and the solution is to use Rc in place of Box (or, in a long-term view, a GCed pointer) in the type RichTerm. Even if it's not dominating today, it's probably an important performance issue nonetheless.

Environment could be a chained list of environments, with the outer ones shared and creating a new environment would add a new one on head of the list, something like
struct Environment {
 current: HashMap<Ident, Thunk>,
 previous: Option<Rc<RefCell<Environment>>>,
}

In that situation, looking for an identifier up the chain could be a bit inefficient, because we don't know at which level a specific identifier sits. For example, querying for a top-level binding x in an environment with 5 levels will go like this: query the first hashmap, not found, go one level up, query the second hashmap, not found, go one level up, etc.

Nix uses De Bruijn indices to solve this problem, so that variables are statically assigned a level and an index before evaluation. Currently the Nickel evaluator sometimes needs to generate terms dynamically during evaluation, and that may mess up the level and indices, meaning this is not totally trivial to adapt, although I think it is still doable using a few offsets here and there. Another solution is to try to get rid of these dynamically generated terms, for example by generating them at program transformation time already, if possible.

In any case, the type you propose (modulo some details: I'm not sure what is the reason for the RefCell?) may be a reasonable first step, given we define an environment trait generic enough to easily switch implementations.

For the record, just as a thing to explore one day (but not necessarily now), we could also try to implement the environment structure described in the section 8 of this paper on environment and abstract machines : Okasaki's random-access lists that provides list operations in constant time, and random access/update in logarithmic time.

Acaccia commented 3 years ago

Hey @yannham , thanks for the feedback. I agree, this is not an optimal substitution, since it trades "lookups" for "clonings". This just seems like an easy substitution to try, just for testing if optimizing the environment could give us better benchmarks results. If not, we'll have to find something better. But following the results of the DHAT analysis, we really have no choice but to reduce the cloning as much as possible.

I'm confused by your remark on RefCell. Maybe it is because I am not completely familiar with the code yet, but won't we ever need a way to introduce mutability in the previous Environments ?

yannham commented 3 years ago

I'm confused by your remark on RefCell. Maybe it is because I am not completely familiar with the code yet, but won't we ever need a way to introduce mutability in the previous Environments ?

Although the current code does remove bindings from the environment because it can afford to (the environment is a local copy), I suspect we can get by without needing to mutate the previous environments themselves unless we are morally owning them (all the previous rc pointers are 1-rc) but in this case we don't need the additional RefCell.

Anyway, this is premature nitpicking. The proposal sounds reasonable and is well-delimited: let us make a first draft PR, see how the benchmarks go, and discuss the technical details there.

Acaccia commented 3 years ago

Closed with PR #381

mboes commented 2 years ago

@Acaccia the environment representation you proposed above has pros and cons. I see that this issue is now closed following the merge of #381. Have you documented in any issue or PR the before and after effect on performance and/or memory usage?

Acaccia commented 2 years ago

@mboes Sorry to answer this late, I missed this notification. Please see this comment for benchmarks before/after this change : https://github.com/tweag/nickel/pull/381#issuecomment-911715011

tweag / nickel

Performance improvement: Environment representation #367