mindsdb / lightwood

Lightwood is Legos for Machine Learning.
GNU General Public License v3.0
434 stars 92 forks source link

Replace dill to pickle #1194

Closed StpMax closed 5 months ago

StpMax commented 9 months ago

I noticed that pickle works much faster than dill when dump model. Results for model dump (regular home_rentals):

def x():
    with open(file_path, "wb") as fp:
        pickle.dump(self, fp, protocol=5)
timeit.timeit(x, number=100)
0.31827622700075153

def y():
    with open(file_path, "wb") as fp:
        dill.dump(self, fp)
timeit.timeit(y, number=100)
2.572783964998962

Model load is equal:

def x():
    with open(state_file, 'rb') as fp:
        predictor = pickle.load(fp)
timeit.timeit(x, number=100)
0.26572044099884806

def y():
    with open(state_file, 'rb') as fp:
        predictor = dill.load(fp)
timeit.timeit(y, number=100)
0.26366778900046484

What is the reasons to use dill? If there is no reason, then @paxcema can you please confirm results on your pc, and if it also faster, then may be change dill to pickle?

paxcema commented 9 months ago

Interesting benchmarks @StpMax, I think in general this is expected (source).

I do remember there was a reason we moved from pickle to dill at some point a couple years ago. I want to say it had to do with the confidence module, but I will double check. We now include this module within Lightwood, so there is a chance we can make pickle the new default. Alternatively, we can try the option recommended in the link above byref=True to benchmark against pickle as a middle ground.

paxcema commented 5 months ago

@StpMax is this still relevant? If we really need the extra performance gains, we can look into merging this. Otherwise I would close it, as indeed some of the classes in our confidence module require dill (pickle breaks it).

StpMax commented 5 months ago

@paxcema let close it for now