osirrc / ciff

Common Index File Format to to support interoperability between open-source IR engines
http://ciff.osirrc.io/
31 stars 3 forks source link

Document gap compression #19

Open lintool opened 4 years ago

lintool commented 4 years ago

We need to explicitly document that docids are gap compressed, both in README and in the protobuf definition (i.e., in comments).

cmacdonald commented 3 years ago

Indeed, this is not clear from the protobuf definition.

JMMackenzie commented 3 years ago

This is a slightly odd one, because the gap compression only arises due to the way the Lucene export is engineered. So I guess are we going to assume that any other system which may want to export a CIFF should also be doing delta compression? In that case, we should definitely document it with the CIFF/protobuf definition.

On the other hand, there's nothing inherently in the definition of the protobuf which makes it necessary to store deltas. Thoughts?

chriskamphuis commented 3 years ago

I think only the description should be updated. If systems are allowed to also export without storing delta's, a system has to know how the CIFF is constructed before reading it. It would be desirable to be consistent on how CIFF should be constructed given an index.

cmacdonald commented 3 years ago

Jimmy's implementation of the Lucene index export adds in the delta gap (this isnt related to the Lucene index itself). Assuming its the defacto base, then readers and writers have to be aware of d-gaps. All of our impls now have d-gaps.

Arguably the name "docid" in the Posting object definition is what is wrong - if we were always going to use d-gaps, the name should have been different. As suggested in the OP, its documentation changes that are needed.

JMMackenzie commented 3 years ago

I've started a branch to work on some improved documentation: https://github.com/osirrc/ciff/tree/documentation

Please feel free to contribute.

cmacdonald commented 3 years ago

Changes made.