nucypher / zerodb

*This project is no longer actively maintained. If you'd like to become the maintainer, please let us know.* ZeroDB is an end-to-end encrypted database. Data can be stored and queried on untrusted database servers without ever exposing the encryption key. Clients can execute remote queries against the encrypted data without downloading all of it or suffering an excessive performance hit.
GNU Affero General Public License v3.0
1.56k stars 102 forks source link

Field(): BTree of TreeSets to be BTree of tuples if number of object per value is small #31

Closed michwill closed 8 years ago

michwill commented 8 years ago

Currently Field() index is a BTree of TreeSets (for example, BTree(year -> TreeSet(uids))). However this makes very little sense in a case of rarely repeated values, especially floats. If we have, say, less then 10 objects with the same value of an attribute (to be determined by pickle sizes + encryption overhead), we should use tuples of ids for this particular value. If there is only one value, makes sense to use the (integer) value itself.

Space savings in index with using value itself when there is only one instead of TreeSets are going to be ~5 bytes per 4 byte integer value vs ~87 bytes per same value currently (sic!)

xueyumusic commented 8 years ago

It seems that it need to modify zope.index.field.index.FieldIndex.index_doc method. However I have not found a convenient way to only through subclass override this method. There are two ways I think which maybe could work. One is to change CatalogFieldIndex's mro, insert a new base class between CatalogIndex and FieldIndex. Another way maybe it need to use some trick to monkey patch index_doc method directly... I am not sure which one is better or there are some other better ways...

michwill commented 8 years ago

Wouldn't it work to just re-define the whole "index_doc" method in CatalogFieldIndex? I know, a lot of original code of index_doc would be copied across into it but that's ok. Also update would not necessarily work since instead of doing "docids.add(...)" you'd do "docids += (docid,)", you'd need to consider cases of TreeSet and the tuple separately. And also if you have an integer docid instead of a tuple.

This also could require changing query methods which currently expect TreeSets of docids to appear assigned to values.

Also it is not only in zope.index.field.index, it is also in repoze.catalog.indexes.field.CatalogFieldIndex, so you'd need to see if you need to rewrite those. In case everything appears rewritten, the dependency on repoze or zope.index could be cut off here (in this case, don't forget "implements" decorator).

michwill commented 8 years ago

Testing with larger datasets than in py.test...