Closed vaaaaanquish closed 2 years ago
Sample that takes a long time to start.
import luigi
import gokart
class Zero(gokart.TaskOnKart):
def run(self):
self.dump(0)
class Add(gokart.TaskOnKart):
x = gokart.TaskInstanceParameter()
y = luigi.IntParameter()
def run(self):
self.dump(self.load() + self.y)
x = Zero()
for i in range(100):
x = Add(x=x, y=i)
gokart.build(x)
DictParameter
recursively json serializes the parameters.
https://github.com/spotify/luigi/blob/master/luigi/parameter.py#L1003
Same goes for TaskInstanceParameter
. And TaskInstanceParameter
is using DictParameter
.
https://github.com/m3dev/gokart/blob/master/gokart/parameter.py
These are what's causing this hell.
There are two solutions.
luigi.Task.to_str_params
is overridden by gokartTaskInstanceParameter.serialize
has its own serializeTaskInstanceParameter.serialize
is executed 25249 times in the above sample code.
And values
what's inside serialize
is following
{'type': 'Zero', 'params': {}}
{'type': 'Add', 'params': {'x': '{"type": "Zero", "params": {}}', 'y': '0'}}
{'type': 'Zero', 'params': {}}
{'type': 'Zero', 'params': {}}
{'type': 'Add', 'params': {'x': '{"type": "Zero", "params": {}}', 'y': '0'}}
{'type': 'Add', 'params': {'x': '{"type": "Add", "params": {"x": "{\\"type\\": \\"Zero\\", \\"params\\": {}}", "y": "0"}}', 'y': '1'}}
{'type': 'Add', 'params': {'x': '{"type": "Add", "params": {"x": "{\\"type\\": \\"Add\\", \\"params\\": {\\"x\\": \\"{\\\\\\"type\\\\\\": \\\\\\"Zero\\\\\\", \\\\\\"params\\\\\\": {}}\\", \\"y\\": \\"0\\"}}",
"y": "1"}}', 'y': '2'}}
...
Imagine this being repeated 25249 times :)
https://github.com/m3dev/gokart/pull/257 will solve the problem of bloated memory.
[future] Caching TaskInstanceParameter.serialize
input can speed up the process.
This is recursion.
https://github.com/m3dev/gokart/blob/master/gokart/task.py#L285-L303
self.to_str_params(only_significant=True)
append the result of the json serialization of the parameter. As a result of repeated json serialization, we have the following in dependencies.Gokart use a lot of memory when long pipeline. And the job is starting very slowly.