Binary OWL vs. compression: some analysis

sesuncedu commented 9 years ago

Files encoded using Binary OWL show really poor compression ratios with gzip and xz.

For example, for the Gene Ontology, uncompressed FSS uses 58M uncompressed binowl uses 28M

But:

gzipped FSS uses 5.3M; xz uses 3.1M. gzipped binowl uses 8.8M; xz uses 4.4M

Since the xz compressed FSS shrinks to 2.7M is the file is lexically sorted, it seems possible that changing the order in which axioms are written by binary owl may have a salutary effect on compression ratios.

This seems to be the case:

Sorting and renumbering the IRI table prior to output reduces the xz file to 4.3M.
Sorting axioms within each axiom type grouping reduces the gz size to 4.8M, and xz size to 2.8M.

Better results may be possible if the previous approach of literal interning is revived; this would separate strings from the axiom structure, allowing for both to be optimized.

There may be better ordering than the standard owlapi sorts ; e.g. axioms associated with sibling classes may have similar patterns.

The encoding format may benefit from using variable length integer encodings, and entity/literal references may benefit from delta-encoding. These changes can benefit uncompressed size and performance.

Instead of just using the most recent iri or literal id, delta coding could be made relative to a small number of most recent ids (e.g. four most recent ids). This may create longer matches.

sesuncedu commented 9 years ago

Experimenting using a simple delta-coding for IRI identifiers, with minimal changes to the output format shows some interesting results.

The encoder maintains a table of previously used identifiers whose size is 2^{k}. When looking up a key, the table is scanned to identify the whose value is the smallest absolute difference from the key. If the difference is zero, then that entry is shuffled to the start of the table; otherwise table is shifted right, and the first entry in the table is set to the key being looked up.

The return value is a long containing the delta left-shifted by k bits, with the index of the value used as the base for the delta is stored in the lower k-bits.

To simplify testing, the index length/flag byte is used to indicate whether the coded value fits in a byte, a short, or an int, and the coded value is stored using the appropriate data output form. This simple approach wastes some space, as there is room available in the flag byte to store the delta-base code for small k.

For the GO example above:

k	uncompressed size	xz size	writeIndex calls	byte sized deltas	short deltas	int deltas
-	27.6 MiB	2,876.6 KiB
0	26.5 MiB	3,204.2KiB	1,135,194	0,258,805	681,388	185,738
1	25.7 MiB	2,675.5KiB	1,135,194	0,888,857	140,792	096,282
2	25.4 MiB	2,503.9KiB	1,135,194	1,029,732	076,807	019,392
3	25.4 MiB	2,534.5KiB	1,135,194	1,063,861	047,088	014,982
4	25.4 MiB	2,566.4 KiB	1,135,194	1,082,483	028,477	014,971

This adds an additional ~8% saving in uncompressed size, and a ~13% in compressed size relative to no delta-coding for IRIs.

There is a tiny overhead for table maintenance (especially when decoding), and the table is likely to fit in a single cache line.

The relatively poor performance with only a single table entry (k=0) is to be expected. For example, a reference to the IRI id for xsd:string from a Literal used in an annotation assertion is likely to generate a large negative delta, followed by a large positive delta; the annotation property is also likely to generate a similar pair of deviations.

It is possible that separate history tables for properties, classes, and datatypes might be useful, as might different tables for LHS and RHS of subclass axioms, etc. However, all this is probably secondary to moving strings & literals into their own segment.

matthewhorridge commented 9 years ago

Simon, do you have any idea of how this affects parse time?

sesuncedu commented 9 years ago

Matthew - parse time is basically unaffected (it may be slightly reduced, but there is sufficient noise in the I/O code to make it hard to measure). The Delta coding is especially cheap when decoding, as there's no need to search the history table (which is quite likely to be in a single cache line).

Output time is increased due to the cost of sorting; this is not hideously expensive.

For the possible big win for ontology load times (vs parse times), see my next issue.

sesuncedu commented 9 years ago

Moving literals to their own block, with indexes generated by walking the axioms in the same order as the final axiom output, using k = 4, gives a small win.

	uncompressed size	xz size
before	25.4 MiB	2,503.9KiB
k=4	23.3 MiB	2,486.8 KiB

It's possible that sorting instead of using axiom order will give sufficiently better results for the string table compression to make up for the more random deltas; it may also be the case that separating the textual data from the binary data (i.e. the lengths and datatypes) may be beneficial.

sesuncedu commented 9 years ago

Changing the way that deltas are stored (for literals only, so far) gives another win.

	delta base encoding	base uncompressed size	xz size
k=4	use lower k bits of delta	23.3 MiB	2,486.8 KiB
k=4	use lower k bits of flag word	23.3 MiB	2,186.7 KiB

sesuncedu commented 9 years ago

As above, but LiteralLookupTable serialized in one of two ways, with fields converted from AOS to SOA , and either (a) writing literal with DataOutputStream UTF8String;

   int literalCount;
   IRI[literalCount] datatypes;
   UTF8String[literalCount] lang;
   UTF8String[literalCount] literal;

or (b) writing literal as null terminated UTF8 bytes,

   int literalCount;
   IRI[literalCount] datatypes;
   UTF8String[literalCount] lang;
   struct {
              literal.getBytes(Charsets.UTF_8);
              byte term='\0';    
   } [literalCount]

	table encoding	base uncompressed size	xz size
before	writeRawLiteral(AOS)l	23.3 MiB	2,186.7 KiB
(a)	SOA with writeUTF	24.2 MiB	2,183.4 KiB
(b)	SOA with null termination	23.9 MiB	1,984.9 KiB

Null termination doesn't really add much to the cost of decoding, since there's a mandatory array allocation and copy when the String object is finally built.

Note that the uncompressed size increases in both cases, since I haven't done any special casing for the literals (which are, to a first approximation, all xsd:strings. Since the datatype array precedes the language tag array, language tags can only be emitted for langString / plainLiteral.

The long run of empty strings is eaten by the compression, but there's a little bit of parse time efficiency to be gained by not writing them in the first place.

The IRIs can be used to enable the other datatype specific handling.

(The long run of zero deltas for all the IRIs also compresses well. I haven't changed the delta base encoding for IRIs yet, but it wouldn't really save much in this case. Variable length ints could save ~0.5MB for these IRIs, but there may be issues.

sesuncedu commented 9 years ago

Encoding langString only for plain literals:

	table encoding	base uncompressed size	xz size
	before	23.9 MiB	1,984.9 KiB
	lang for plain only	23.3 MiB	1,984.2 KiB

2.5% saving for uncompressed, 0.035% saving for compressed...

sesuncedu commented 9 years ago

If IRI's are assigned indexes in the order in which they will be encountered when processing (with Declaration axioms handled last, as otherwise the axiom sort imposes an IRI sort), the compressed size increases to 2,120.2 KiB.

Omitting the declaration axioms entirely barely changes this (2,104.7 KiB).

(Improperly) writing the IRI table in sorted order, but using the in-order numbering changes the size to 2,092.6 KiB, indicating that the table compression is not the real issue.

Switching to the IRI base encoding scheme used for literals does not help in-order numbering much (2,116.9 KiB).

With sorted IRIs (and correct numbering), switching the encoding scheme gives sizes of 23.2 MiB uncompressed, and 1,970.2 KiB compressed.

sesuncedu commented 9 years ago

Using an SOA layout for the IRI table doesn't make any real difference, though there are some potential savings in the uncompressed size if IRIs are grouped by start/prefix.

OBO derived ontologies typically have so little entropy in their IRIs that most compressors can handle them easily (though i've wondered about whether some kind of reverse sprintf matching could make some difference - especially with deltas).

sesuncedu commented 9 years ago

So basically we have some fairly easy path to save about 16% off the uncompressed size, and 53% off the compressed size.

Since xz decompression time is almost linear in the compressed size, this gives a savings in decompression time of ~47% (0.440s vs. 0.242s).

Since decompression can be done in parallel with loading, and loading times are dominated by indexing times, this isn't really a huge performance win, though if network bandwidth is an issue, the savings may be enough to make a difference.

The rejiggered output format shows improvements for smaller block and dictionary sizes as well. gzip sizes are about 61% better.

Reduced block sizes with xz also degrades more gracefully.

256K blocks - 2,554.2 KiB vs 7,296.3 KiB (0.35 : 1) 1M blocks - 2,280.9 KiB vs. 6,328.2 KiB (0.36 : 1) 4M blocks - 2,106.4 KiB vs 5,521.5 KiB (0.38 : 1) 16M blocks - 1,987.8 KiB vs 4,847.8 KiB (0.41 : 1) 32M block - 1,973.7 KiB vs 4,534.7 KiB (0.43 : 1)

Note: xz allows for random access by block, but only if the index at the end of the file is readable. Of course, delta tables need to be reset or saved at each block, and blocks need to be started at the beginning of each axiom.

This only really matters for parallel parsing, or if loading on demand is wanted. This really requires persisted indexes, and may also require a different output order - e.g. one might want to group all annotation assertion axioms together in property-major order, so that rdfs:labels are grouped together).

This might work particularly well for web protege.

sesuncedu commented 9 years ago

A few more notes:

Sorting subclassof axioms by superclass then subclass saves, and sorting annotation assertion axioms by literal-id, then property, then subject, saves about 3% off compressed size.

Writing coded IRI deltas as variable length integers, with delta base in the lower bits, and no explicit length word, reduces the uncompressed size by about 5%, but increases the compressed size slightly. This is probably due to alignment issues.

sesuncedu commented 9 years ago

Some more notes: For the Gene Ontology, there are about 16MiB of literal strings.
When emitted in a roughly sorted order, the compressed size is 1,476.2 KiB. Again, this is just the literal strings. When literals are emitted in as encountered in the alternate order listed below, the compressed size is 1,779.5 KiB.

There are about 442.4 KiB worth of IRI fragments/remainders. When emitted in sorted order (within prefix), the compressed size is 15.7 KiB When emitted in a non lexicographically sorted order the compressed size is 53.0 KiB.

The alternate ordering is:

sorted annotation, data, and object properties,
then sorted datatypes,
then by order in which IRIs are first encountered when axioms are sorted then walked
- with annotation assertion axioms first ( sorted by value then property, then subject)
- then everything but declaration axioms (with subclass axioms sorted superclass first)
- then declaration axioms

sesuncedu commented 9 years ago

It seems possible that generating IRI and literal tables, and then writing axioms, using an based on walking using an ordering where unseen entities used in the RHS of a subclass / equivalence are processed before those used in the LHS may give good results for ontologies that have a lot of precoordinated strings, where phrases that make up parts of the literals associated with a class may be associated with entities that make up part of the super class / defining class expression.

These strings may not be close in lexicographically.

OBO ontologies seem to use this pattern a lot.
e.g. many of the strings associated with the first class are used with the second class, but with the prefix "negative regulation of".

How common this style of precoordination is in other large ontologies is the interesting question.

Declaration(Class(obo:GO_0097200))
Declaration(Class(obo:GO_2001271))
# Class: obo:GO_0097200 (cysteine-type endopeptidase activity involved in execution phase of apoptosis)
AnnotationAssertion(rdfs:label obo:GO_0097200 "cysteine-type endopeptidase activity involved in execution phase of apoptosis"^^xsd:string)
AnnotationAssertion(oboInOwl:hasNarrowSynonym obo:GO_0097200 "effector caspase activity"^^xsd:string)
AnnotationAssertion(oboInOwl:hasNarrowSynonym obo:GO_0097200 "executioner caspase activity"^^xsd:string)
AnnotationAssertion(oboInOwl:hasOBONamespace obo:GO_0097200 "molecular_function"^^xsd:string)
AnnotationAssertion(oboInOwl:id obo:GO_0097200 "GO:0097200"^^xsd:string)
EquivalentClasses(obo:GO_0097200 ObjectIntersectionOf(obo:GO_0004197 ObjectSomeValuesFrom(obo:BFO_0000050 obo:GO_0097194)))
SubClassOf(obo:GO_0097200 obo:GO_0097153)
SubClassOf(obo:GO_0097200 ObjectSomeValuesFrom(obo:BFO_0000050 obo:GO_0097194))
EquivalentClasses(obo:GO_2001270 ObjectIntersectionOf(obo:GO_0065007 ObjectSomeValuesFrom(obo:RO_0002211 obo:GO_0097200)))
SubClassOf(obo:GO_2001270 ObjectSomeValuesFrom(obo:RO_0002211 obo:GO_0097200))

# Class: obo:GO_2001271 (negative regulation of cysteine-type endopeptidase activity involved in execution phase of apoptosis)
AnnotationAssertion(Annotation(oboInOwl:hasDbXref "GOC:mtg_apoptosis"^^xsd:string) obo:IAO_0000115 obo:GO_2001271 "Any process that stops, prevents or reduces the frequency, rate or extent of cysteine-type endopeptidase activity involved in execution phase of apoptosis."^^xsd:string)
AnnotationAssertion(oboInOwl:created_by obo:GO_2001271 "pr"^^xsd:string)
AnnotationAssertion(oboInOwl:creation_date obo:GO_2001271 "2011-12-13T07:59:42Z"^^xsd:string)
AnnotationAssertion(Annotation(oboInOwl:hasDbXref "GOC:obol"^^xsd:string) oboInOwl:hasNarrowSynonym obo:GO_2001271 "negative regulation of effector caspase activity"^^xsd:string)
AnnotationAssertion(oboInOwl:hasOBONamespace obo:GO_2001271 "biological_process"^^xsd:string)
AnnotationAssertion(oboInOwl:id obo:GO_2001271 "GO:2001271"^^xsd:string)
AnnotationAssertion(rdfs:label obo:GO_2001271 "negative regulation of cysteine-type endopeptidase activity involved in execution phase of apoptosis"^^xsd:string)
EquivalentClasses(obo:GO_2001271 ObjectIntersectionOf(obo:GO_0065007 ObjectSomeValuesFrom(obo:RO_0002212 obo:GO_0097200)))
SubClassOf(obo:GO_2001271 obo:GO_0043154)
SubClassOf(obo:GO_2001271 obo:GO_2001270)
SubClassOf(obo:GO_2001271 ObjectSomeValuesFrom(obo:RO_0002212 obo:GO_0097200))
EquivalentClasses(obo:GO_2001272 ObjectIntersectionOf(obo:GO_0065007 ObjectSomeValuesFrom(obo:RO_0002213 obo:GO_0097200)))
SubClassOf(obo:GO_2001272 ObjectSomeValuesFrom(obo:RO_0002213 obo:GO_0097200))

sesuncedu commented 9 years ago

Using the quasi-hierarchical ordering has virtually no effect on the compressed size of the literal table, but the improved correlation of entity references makes a major difference in overall compression size.

The overall win in compressed sizes seems to be about 8-10%.

The results may be different for different orders of walking the tree.
Axioms are still output grouped by axiom type, rather than being grouped by entity. For Annotation Assertion heavy ontologies this doesn't make quite as much difference, since these are now being sorted in subject-major order.

Also, I am using two delta history tables; one for iris, and one for literals. This loses some context compared to using separate tables for different uses (e.g. a table for primary entity; a table for annotation properties; a table for literals in OWLAnnotations, etc.

This is slightly mitigated by the replacement policy in the delta history tables; if the delta for the closest match is != 0, but falls between configurable lower and upper bounds, the lookup is assumed to be part of a reference stream, and the closest entry is replaced.
Otherwise, the least recently used entry is replaced.

Allowing negative deltas to use replace-closest reduces compression; probably due to interference with filling the cache with frequently used entities. Separate delta history tables for properties may help address this, since negative deltas may signal a stream being reset (and the delta + stream id may be more predictable).

sesuncedu commented 9 years ago

But I think I need to stash my changes, and start applying them cleanly, with readers and writers matched, and in a way that is easier to control with automatic benchmarking.

I also need to create a separate version with just the default axiom sorting, which has no backwards compatibility issues, and gives a worthwhile win on its own (4.4M -> 2.8M).

matthewhorridge / binaryowl

Binary OWL vs. compression: some analysis #6