pferrel / template-scala-parallel-universal-recommendation

39 stars 89 forks source link

UnionRDD $plus$plus taking a long time #28

Open mamoit opened 7 years ago

mamoit commented 7 years ago
UnionRDD [64]
$plus$plus at URModel.scala:86

Is taking a looong time compared to everything else. I have a lot of new items coming in all the time, and this stage in particular (Stage 35) gets stuck for 1h. I'm using a spark cluster with 3, 4core, nodes with around 16GB of RAM each, but it seems that I can't speed up this stage in particular (which is the bottleneck for the whole train).

The training CPU usage looks something like this: screenshot_20170504_110902 That plateau is the stage 35, more precisely the $plus$plus at URModel.scala:86

Any idea on how to get over this bottleneck?

pferrel commented 7 years ago

Welcome to big data ;-)

Spark requires that all data be in-memory over the cluster so 16g per node may be undersized for memory, how big is your data when exported?

Also the read and write operations often take the most time and you can't always trust what it says about the code that is executed. The Spark GUI is doing it's best to say what is long but can only get to the resolution of a closure, inside, there are many lines of code some of which trigger lazy evaluation of many other things in the Spark DAG.

The spike may be a bottleneck but since it completes quickly and (I assume) no errors, is that a problem? The long time at low load can often be improved with better partitioning, which is included in the UR v0.5.0 and PIO-0.10.0, the Apache versions of things.

FYI this repo is being deprecated since we merged our version of PIO with the Apache PIO. The UR template has moved to https://github.com/actionml/universal-recommender

mamoit commented 7 years ago

Thanks :smile:

I can't export the data right now, so I can't give a precise figure, but HBase has 3.1G used.

The bottleneck here is really the long time at low load, probably due to the not optimized partitioning. It has been getting worse with the number of items that can be recommended, and not with the number of events in general. We had a big inventory spike, which resulted in this plateu. The node's network is quite idle during that time, so it is not a problem of accessing HBase or ES.

Is there a way to improve the partitioning without changing PIO's version? I tried the apache version a while ago, but it was still too quirky.

mamoit commented 7 years ago

I noticed that my driver doesn't have as much memory as the nodes themselves... I'll try to increase it.

mamoit commented 7 years ago

I increased the memory to 14GB (which it doesn't fully make use of), and the plateau now lasts for 1h. Way shorter, but still quite long.

pferrel commented 7 years ago

This not a supported template, please leave questions https://groups.google.com/forum/#!forum/actionml-user

And use the template for PIO-0.10.0 here: https://github.com/actionml/universal-recommender

Why do you find the behavior wrong? I see nothing unusual about it without access the the GUI. Training often take hours but since the query and input servers continue to operate without interruption this is normal behavior.