Open exalate-issue-sync[bot] opened 1 year ago
Former user commented: Partially fixed, since I don't pass the full DataInfo object, but all of GLRMParameters is still being serialized.
JIRA Issue Migration Info
Jira Issue: PUBDEV-1535 Assignee: Former user Reporter: Former user State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
GLRM exhibits slowdown in multi-node cluster runs. The suspected culprit is the amount of serialization required in each MRTask, particularly the DataInfo and GLRMParameters objects, which require a lot of communication between nodes. [~accountid:557058:e393304e-df0f-4e4f-a4bf-cb0cdf121b88] is currently testing this hypothesis.
When running on large data (e.g., BigCross), the slowdown due to network communication is offset by the speed of distributed computation, so that GLRM will still run faster in multi-node than single node as desired.