m3dev / gokart

Gokart solves reproducibility, task dependencies, constraints of good code, and ease of use for Machine Learning Pipeline.
https://gokart.readthedocs.io/en/latest/
MIT License
318 stars 57 forks source link

Recursively large parameters #254

Closed vaaaaanquish closed 2 years ago

vaaaaanquish commented 3 years ago

This is recursion.

https://github.com/m3dev/gokart/blob/master/gokart/task.py#L285-L303

self.to_str_params(only_significant=True) append the result of the json serialization of the parameter. As a result of repeated json serialization, we have the following in dependencies.

dependencies.append(self.to_str_params(only_significant=True))

\"params\": {\"target\": \"{\\\"type\\\": \\\"task.Aggregation\\\", \\\"params\\\": {\\\"train\\\": \\\"{\\\\\\\"type\\\\\\\": \\\\\\\"task.Sample\\\\\\\", \\\\\\\"params\\\\\\\": {\\\\\\\"target\\\\\\\": \\\\\\\"{\\\\\\\\\\\\\\\"type\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\"task.Query\\\\\\\\\\\\\\\", \\\\\\\\\\\\\\\"params\\\\\\\\\\\\\\\": {\\\\\\\\\\\\\\\"target\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\"{\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"type\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"task.Add\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\", \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"params\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": {\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"target\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"{\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"type\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"task.Drop\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\",

Gokart use a lot of memory when long pipeline. And the job is starting very slowly.

vaaaaanquish commented 3 years ago

Sample that takes a long time to start.

import luigi

import gokart

class Zero(gokart.TaskOnKart):
    def run(self):
        self.dump(0)

class Add(gokart.TaskOnKart):
    x = gokart.TaskInstanceParameter()
    y = luigi.IntParameter()

    def run(self):
        self.dump(self.load() + self.y)

x = Zero()
for i in range(100):
    x = Add(x=x, y=i)

gokart.build(x)
vaaaaanquish commented 3 years ago

DictParameter recursively json serializes the parameters. https://github.com/spotify/luigi/blob/master/luigi/parameter.py#L1003

Same goes for TaskInstanceParameter. And TaskInstanceParameter is using DictParameter. https://github.com/m3dev/gokart/blob/master/gokart/parameter.py

These are what's causing this hell.

vaaaaanquish commented 3 years ago

There are two solutions.

  1. luigi.Task.to_str_params is overridden by gokart
  2. TaskInstanceParameter.serialize has its own serialize
vaaaaanquish commented 3 years ago

TaskInstanceParameter.serialize is executed 25249 times in the above sample code.

And values what's inside serialize is following

{'type': 'Zero', 'params': {}}
{'type': 'Add', 'params': {'x': '{"type": "Zero", "params": {}}', 'y': '0'}}
{'type': 'Zero', 'params': {}}
{'type': 'Zero', 'params': {}}
{'type': 'Add', 'params': {'x': '{"type": "Zero", "params": {}}', 'y': '0'}}
{'type': 'Add', 'params': {'x': '{"type": "Add", "params": {"x": "{\\"type\\": \\"Zero\\", \\"params\\": {}}", "y": "0"}}', 'y': '1'}}
{'type': 'Add', 'params': {'x': '{"type": "Add", "params": {"x": "{\\"type\\": \\"Add\\", \\"params\\": {\\"x\\": \\"{\\\\\\"type\\\\\\": \\\\\\"Zero\\\\\\", \\\\\\"params\\\\\\": {}}\\", \\"y\\": \\"0\\"}}",
 "y": "1"}}', 'y': '2'}}
...

Imagine this being repeated 25249 times :)

vaaaaanquish commented 3 years ago

https://github.com/m3dev/gokart/pull/257 will solve the problem of bloated memory.

vaaaaanquish commented 3 years ago

[future] Caching TaskInstanceParameter.serialize input can speed up the process.