kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
488 stars 111 forks source link

Pickling issues #71

Closed TrigonaMinima closed 4 years ago

TrigonaMinima commented 4 years ago

Hi, I have found a problem in pickling and unpickling the trees. After following through the original streaming example I tried pickling and unpickling a single tree, but the results are not same.

t = forest[0]
with open("a.pkl", "wb") as f:
    pickle.dump(t.to_dict(), f)

t2 = rrcf.RCTree()
with open("a.pkl", "rb") as f:
    t2.load_dict(pickle.load(f))

len(t.leaves)
# 257

len(t2.leaves)
# 238

I thought the issue is while calling pickle.dump as the tree dict is nested, but the documentation says it'll raise RecursionError if such an object is encountered. So I think the issue could be with the to_dict or load_dict functions. I used both pickle and dill to test this.

mdbartos commented 4 years ago

Interesting. Did you try writing the dict to a json file without pickling? e.g.:

import json

with open('tree.json', 'w') as outfile:
    json.dump(obj, outfile)
TrigonaMinima commented 4 years ago

Hey @mdbartos exactly same problem with the json.

t = forest[0]
with open("a.json", "w") as f:
    json.dump(t.to_dict(), f)

t2 = rrcf.RCTree()
with open("a.json", "r") as f:
    t2.load_dict(json.load(f))

len(t.leaves)
# 257

len(t2.leaves)
# 238

There really seems to be an issue with with either to_dict or load_dict function. Tests should be implemented for this as well. I couldn't understand the code completely else I would have given a pull request. Will try to give another go sometime later in the week.

mdbartos commented 4 years ago

Greetings,

I haven't been able to replicate this yet. Is it possible to get a data sample? I wonder if it may have something to do with duplicates.

mdbartos commented 4 years ago

I've confirmed that this is an issue related to duplicate points. Moreover, it only seems to be a problem with the leaves dict--the tree structure itself appears to be ok.

The problem is on lines 816-817 of rrcf.py. to_dict may need to be rewritten to keep track of all the original indices for a given duplicated point.

mdbartos commented 4 years ago

Greetings, this should now be addressed in #74