During one of the customer interactions, I tried loading a ~700mb CSV dataset
from an external dataset, to an internal one, with a trivial transformation
between the two. With default settings, this did not work well or at all- on a
Macbook Air with 4GB RAM, it took about 1754s vs. 148s for a bulk load. On my
laptop, the default settings would not even finish. Tweaking the config to
allow the memory component to grow past 32MB allowed the workload to finish,
but it was still about 8x slower (~823s vs. 127s).
The default settings let the memory component be about 32MB per index. I
believe it was said that this might not be an optimal choice. I also recall
that all indices share the same settings, so there is issue with making it too
big. However if it is too small, it seems like one can get into a situation
rather easily where more time is spent merging components than actually
inserting data (or at least this is how it seems). I do have a Yourkit snapshot
of both of these scenarios, however they are too big to attach to the issue. If
they are of interest please let me know and I can email them.
Original issue reported on code.google.com by ima...@uci.edu on 11 Oct 2014 at 12:09
Original issue reported on code.google.com by
ima...@uci.edu
on 11 Oct 2014 at 12:09