Closed GoogleCodeExporter closed 8 years ago
Some pointers:
* http://www.w3.org/Submission/2011/SUBM-HDT-20110330/
* https://code.google.com/p/hdt-it/
=> see also [2]
Some papers:
* Compression of RDF Dictionaries [1]
* Binary RDF for scalable publishing, exchanging and consumption in the web of
data [2]
* TripleBit: a Fast and Compact System for Large Scale RDF Data, Sect. 3.1, [3]
[1] http://dataweb.infor.uva.es/wp-content/uploads/2011/10/sac2012.pdf
[2] http://dataweb.infor.uva.es/wp-content/uploads/2012/03/WWW12-copy.pdf
[3] http://www.vldb.org/pvldb/vol6/p517-yuan.pdf
Original comment by andreas.josef.wagner
on 19 Mar 2014 at 5:34
Hm ... [3] seems to use the same dictionary approach as "Scalable join
processing on very large RDF graphs" [4]
- Andreas
[4] http://www.dblab.ntua.gr/~gtsat/collection/RDF%20stores/ScalableJoinsRDF.pdf
Original comment by andreas.josef.wagner
on 19 Mar 2014 at 5:47
Hi guys,
I did some commits on top of issue #19. Other than some minor optimizations
about
- checkstyle
- cobertura (additional test cases)
- instance reusing
- ISharedConstants.INCLUDE_ALL_COMPOSITE_LOWER_BOUND /
INCLUDE_ALL_COMPOSITE_HIGHER_BOUND: previously for each query two new
Composite() instances were created. Since these are used only as
"readonly"
criteria for determining bounds, now we have just 2 instances reused
everywhere. Of course they cannot be used in case you need to chnage the
state of the composites (like for example adding components).
- SliceFilter: basically the same consideration; i removed the local
instances created each time a query is executed (TripleStore,
QuadStore and
CAndPCSlicesQueryIterator) in favour of shared instances so also here we
have a lot of gain in terms of heap (young generation)
The most important news about dictionary are
1) Renaming
According with Andreas,
- SimpleCassandraMapDictionary now is ReadOptimizedDictionary
- InMemoryDictionary is WriteOptimizedDictionary
2) Dictionary interface, additional methods
For some column family, row key is composite, that is, is composed by two
or three identifiers (byte[]). The rules for encoding and decoding these
composite identifiers were previously in different points in the code. So
for example some iterator had hard-coded something like
byte [] predicate = new byte[17];
Now, after introducing the second dictionary, those classes no longer
worked anymore because, for example, WriteOptimizedDictionary uses
variable-length identifiers (ReadOptimizedDictionary uses fixed length
identifiers). So, at the end, the Dictionary interface includes now all
methods for composing and decomposing those composite identifiers.
I thing now the interface is more consistent because it includes ALL what
CumulusRDF expects from a dictionary in term of services.
3) Class hierarchy
Base classes are in package *edu.kit.aifb.cumulus.store.dict*
- INodeDictionary: dictionary interfaces; here I moved all methods
needed from a client perspective. Previously some of those methods were in
NodeDictionaryBase so clients (e.g. Store classes) had to use
NodeDictionaryBase instead of INodeDictionary.
- NodeDictionaryBase: abstract superlayer type containing shared and
common behaviour between all concrete dictionaries*;*
- NodeDictionaryWithCaching: extends NodeDictionaryBase and adds ids /
values caching*;*
Concrete classes are in package *edu.kit.aifb.cumulus.store.dict.impl*
- ReadOptimizedDictionary: previous SimpleCassandraMapDictionary. This
class is still in progress because things discussed in this thread;
- WriteOprimizedDictionary: a (dummy) dictionary using string encoding
for generating identifiers. TODO: long literals handling as discussed in
this thread.
4) Configurability
Dictionary can be injected when the store instance is created. It defaults
to ReadptmizedDictionary
Best,
Andrea
2014-03-19 18:47 GMT+01:00 <cumulusrdf@googlecode.com>:
Original comment by a.gazzarini@gmail.com
on 23 Mar 2014 at 6:34
Pointer for the read-optimized dictionary:
(1) As Andrea (in some previous email) already suggested, I also think we
should use a layer of "approximate membership queries" [2] as a form of filter
over the storage (cassandra). This way, we should be able to avoid many
unnecessary cassandra probes. For the approximate membership queries we could
either use the traditional BloomFilters (e.g., the BloomFilters in the Google
Guave project [4]) or something more exotic like the Quotient filter [3]. For
the latter, I found an implementation in [5]. However, BloomFilters are
probably more "production-ready" than e.g. [5]. The only thing is, we have to
de/serialize the BloomFilters in case the store closes ... Otherwise we can't
reuse the BloomFilters properly. This is unfortunately a bit messy ...
(2) We should be a bit more careful with using putQuickBuffered-method in
edu.kit.aifb.cumulus.util.hector.Index. If the store shuts down (for whatever
reason), the memory-based caches will be lost. Therefore the data encoded with
these IDs is not usable any more. So, having a "large" caches comes with a
great risk.
Overall, I think we should rather spend some more memory on the "approximate
membership queries" (e.g., bloom filters), than on the caches. Maybe don't use
caches at all for the dictionaries (this would be the safest choice) ...
Kind regards
Andreas
[1] https://code.google.com/p/cumulusrdf/issues/detail?id=19
[2] http://en.wikipedia.org/wiki/File:Bloom_filter_speed.svg
[3] http://en.wikipedia.org/wiki/Quotient_filter
[4] http://code.google.com/p/guava-libraries/wiki/HashingExplained#BloomFilter
[5] https://github.com/Vyzen/trout
Original comment by andreas.josef.wagner
on 31 Mar 2014 at 10:15
Sorry I forgot: this issue has been fixed in 1.1.0
Original comment by a.gazzarini@gmail.com
on 30 Apr 2014 at 3:52
Original issue reported on code.google.com by
andreas.josef.wagner
on 22 Jan 2014 at 8:32