Closed GoogleCodeExporter closed 9 years ago
@Sebastian: I see two possible solutions:
* implement the map based on the CQL map datatype [1] //maybe this is the
easiest way to go
* implement the map using the bidirectional DOA and make sure that it works
like this:
** use two columns
key: ID -> col: RDF value
** don't store RDF values as column name, but instead as actual column value
** use a secondary index on "RDF value" column ...
[1] http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_map_t.html
Original comment by andreas.josef.wagner
on 22 Apr 2014 at 12:07
On second thought, I see two TODOs:
@Andrea: could we use hard size limit for using the "persistentValueDictionary"
[2]? Given a Cassandra storage, this limit must be <= 64KB
@Sebastian: could you make sure that the Index class [1] works with RDF values
>= 64 KB? See the above points ...
I think the bugfix should be rather easy ... however, the bug is critical ;)
[1]
https://code.google.com/p/cumulusrdf/source/browse/branches/1.1.0/cumulusrdf-cor
e/src/main/java/edu/kit/aifb/cumulus/util/hector/Index.java
[2]
https://code.google.com/p/cumulusrdf/source/browse/branches/1.1.0/cumulusrdf-cor
e/src/main/java/edu/kit/aifb/cumulus/store/dict/impl/value/PersistentValueDictio
nary.java
Original comment by andreas.josef.wagner
on 22 Apr 2014 at 12:20
Hi Andreas,
I will read and answer more carefully later you post...what you say is
right about persistent value dictionary...however it depends on the
composition of the decorator chain in use: for what I called default
dictionary works exactly like that
Best
Andrea
Original comment by a.gazzarini@gmail.com
on 22 Apr 2014 at 6:06
We could add at the end of the chain, as an additional dictionary or
directly changing the persistentvaluedict, a client-side
compression...however i am not sure if it will cover all scenarios
(thinking about fulltexts wich are several MBs)
On 22 Apr 2014 08:06, "Andrea Gazzarini" <a.gazzarini@gmail.com> wrote:
Original comment by a.gazzarini@gmail.com
on 22 Apr 2014 at 7:17
Hi Andrea,
we'd need to make sure that a very last/fixed decorator is always added
to a dictionary chain. This last decorator enforces that a persistent
value dictionary is used if the byte size exceeds a storage-specific
size (e.g., 64 KByte).
What do you think?
Original comment by andreas.josef.wagner
on 22 Apr 2014 at 9:58
IMO we think how to enforce this behaviour without breaking the dynamic
inheritance offered by the current design. Just relying on documentation
hints could be too weak
Maybe this could be a fixed rule set on a superclass (e.g. all values with
a size > 64k use PVD, in all other cases the path depends on the decorator
chain
But this is just the first thing that comes on my mind...I need to read
something more about that limit in Cassandra columns....in addition we
should think also with modular storage in mind...maybe this is a specific
Cassandra limit so CumulusRdF code should take in account this.
Best,
Andrea
Original comment by a.gazzarini@gmail.com
on 22 Apr 2014 at 2:46
Hi Andreas,
What about a dictionary that uses hashes as suggested [1] but only in case
the value acts as a key?
So for example if we have
"A long literals" with id [17][6][23]
The index class will be managing two maps
[17][6][23] --> "A long literals"
And
<HASH OF "a long literals> --> [17][6][23]
Make sense?
[1] ... but some users with very large "natural" keys use their hashes
instead to cut down the size.
Original comment by a.gazzarini@gmail.com
on 23 Apr 2014 at 11:09
Hi Andrea,
yes :) I completely agree ... this is what I meant by: "This problem is
"a bit" solved by the fact that we use dictionary encoding. That is, we
don't store the actual RDF values in our SPO indexes - instead we stored
a dictionary encoding of the values."
The problem is, that this: <HASH OF "a long literals> --> [17][6][23]
Won't work, because the <HASH OF "a long literals> is not a unique key.
I'm afraid we have to have a mapping like this: "A long literals" -->
[17][6][23]. For this, we have to make sure that the key ("A long
literals") can be stored, even it is >= 64KByte.
I discussed this with Sebastian - I think the way to go is:
(1) Create a map: [17][6][23] --> "A long literals" and store "A long
literals" as value in the column
(2) Have secondary index on the values, e.g., "A long literals". This
secondary index allows the look-up "A long literals" --> [17][6][23].
What do you think?
Kind regards :)
Andreas
Original comment by andreas.josef.wagner
on 24 Apr 2014 at 12:47
To expand this:
The util/hector/CassandraHectorMap has already implemented the feature we need.
Currently, two maps are used for the Index. But the map has a bidirectional
flag, which creates a secondary index on the value, allowing reverse lookups.
Using a bidirectional map to map id -> value, we would't have any problem with
too long literals.
Have a nice day
Sebastian
Original comment by Isib...@gmail.com
on 24 Apr 2014 at 1:51
Hi guys,
we have a problem here.
Using one table with a key -> value mapping and a secondary index on the value,
we can insert large values without problem. But we cannot do a value -> key
lookup, because the SELECT query on this secondary index doesn't like large
value. This [1] is the error we get trying to do a value -> key lookup.
What we could do now is hashing the values ourselves and then implement a
persistent hash table with linear probing, or with the cassandra sets to handle
collisions. This could be implemented as another dictionary layer. What do you
guys think?
[1]:
com.datastax.driver.core.exceptions.InvalidQueryException: Index expression
values may not be larger than 64K
at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:172)
at com.datastax.driver.core.SessionManager.execute(SessionManager.java:91)
at edu.kit.aifb.cumulus.util.hector.CassandraHectorSimpleDao.getKey(CassandraHectorSimpleDao.java:125)
at edu.kit.aifb.cumulus.util.hector.CassandraHectorMap.getKeyQuick(CassandraHectorMap.java:212)
at edu.kit.aifb.cumulus.util.hector.CassandraHectorMap.containsValue(CassandraHectorMap.java:157)
at edu.kit.aifb.cumulus.CassandraHectorMapTest.largeValueTest(CassandraHectorMapTest.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Index
expression values may not be larger than 64K
at com.datastax.driver.core.Responses$Error.asException(Responses.java:96)
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:108)
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:228)
at com.datastax.driver.core.RequestHandler.onSet(RequestHandler.java:354)
at com.datastax.driver.core.Connection$Dispatcher.messageReceived(Connection.java:571)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Original comment by Isib...@gmail.com
on 30 Apr 2014 at 8:41
Hi Sebastian,
sorry in advance, maybe I didn't get the point: if the issue is strictly
related with PersistentValueDictionary (which is using the map to
persist id / values) that double hash solution (which for me is +1)
should be hardcoded there: if, with dictionary layer, you mean another
decoratee that you could inject this dictionary in
PersistentValueDictionary, at the same time you, as user, couldn't do
that, therefore leaving the dictionary chains open for that bug.
Make sense?
Original comment by a.gazzarini@gmail.com
on 30 Apr 2014 at 9:03
Yes that makes sense.
The bug is caused by our current CassandraMap implementation not supporting
values > 64KB. Nothing with the dictionary. So I guess it would be better to
create a LargeValueCassandraMap or something like that, which can then be used
by the dictionary.
Original comment by Isib...@gmail.com
on 30 Apr 2014 at 10:15
Why don't we put that logic on the existing map? Is that possibile?
2014-04-30 12:15 GMT+02:00 <cumulusrdf@googlecode.com>:
Original comment by a.gazzarini@gmail.com
on 30 Apr 2014 at 10:17
Yes that would be better I guess ;)
But we could use the existing map to create the new map. The existing map can
store large values if used as unidirectional map with no problems. So for the
new map we could use one key -> value_id bidirectional map, and one value_id ->
value unidirectional map.
What do you think about that?
Original comment by Isib...@gmail.com
on 30 Apr 2014 at 10:25
I think I'm lost ;)
I think we should have one map so my question is: can we change the
existing map in order to
- behave like now if values are <64k
- behave differently in case of >64k
In other words, I think the map interface should remain the same; the
implementation could change as we want. Extremely we should change also
the interface but as far as I understood the map client should be
unaware about this kind of switch between behaviours...is that right?
Best,
Andrea
Original comment by a.gazzarini@gmail.com
on 30 Apr 2014 at 10:44
Sorry for being unclear :)
I mean:
* The map interface stays the same.
* The current map implementation stays the same.
* We create a new map implementation that supports bidirectionality with values
>64KB. We use two instances of old map implementation internally to make things
easier.
The map client will just have to change new OldCassandraHashMap() (or however
it was called ;) ) to new NewCassandraHashMap(). Is understandable?
Original comment by Isib...@gmail.com
on 30 Apr 2014 at 10:49
No no, it's my fault don't worry. I think we are saying the same thing.
What I'm trying to say is : ok, I got your point, but, you say:
> The map client will just have to change new OldCassandraHashMap() (or
however it was called ;) ) to new NewCassandraHashMap(). Is understandable?
Why do we have to retain the "Old" implementation? Is not possible to
have only one CassandraMap?
Sorry foir the confusion.
Best,
Andrea
Original comment by a.gazzarini@gmail.com
on 30 Apr 2014 at 10:57
Well, we could also just change the old map.
Original comment by Isib...@gmail.com
on 30 Apr 2014 at 11:08
Just change the old implementation. The distinction between "old" and
"new" map only makes things complicated ;)
Thanks :)
Andreas
Original comment by andreas.josef.wagner
on 30 Apr 2014 at 2:14
Fixed in r1220
Original comment by Isib...@gmail.com
on 8 May 2014 at 1:25
Thanks :) Could you add 1-2 dedicated JUnits tests, which check add/remove of
large literals > 64KByte?
Thanks for your cool work
Andreas
Original comment by andreas.josef.wagner
on 8 May 2014 at 1:41
I already added one for adding, but I forgot the one for removal. Thanks for
the reminder ;)
Original comment by Isib...@gmail.com
on 8 May 2014 at 1:42
Original comment by andreas.josef.wagner
on 8 May 2014 at 2:44
Original issue reported on code.google.com by
andreas.josef.wagner
on 21 Apr 2014 at 11:57