Halyard - Githubissues

nguyenm100 commented 2 years ago

Hi apologies for coming from left field but I noticed you may have been playing around with Halyard and wondered if you're still using it and, if not, why not. I'm evaluating it as well and it looks pretty decent so far so am looking for anything others may have encountered that could be considered a stopper. Thanks!

pulquero commented 2 years ago

Hi there, I'm not using it currently, as I'm not working with large volumes of data at the moment. But, it did me well when I was. I would recommend trying my fork out as I fixed a few bugs in the original, big clean-up of the code, updated dependencies, plus some performance improvements.

I don't think there are any hard-stoppers - I would have addressed them. If there is something to watch for, I would say performance. Not that it is slow or fast, but the performance characteristics maybe different from what you are used to. If you are doing OLAP type stuff you probably dont care, I actually used it for some real-time OLTP type stuff. Are you familiar with HBase? That might be the other gotcha. How much experience you need with HBase to get the most out of it will sort of depend on what you are using it for.

I'm happy to look at issues/accept PRs if further improvements need to be made. For starters, I ought to bump the version of rdf4j.

nguyenm100 commented 2 years ago

Oh excellent regarding your performance improvements. I will take a look at your changes. I did upgrade his version to the latest hbase/hadoop 2.x libs (I actually missed he had released 3.2 and started with 2.5 in master). Did you find this unusable for OLTP stuff? I'm looking for a scalable 3store and this looked to fit the bill.

pulquero commented 2 years ago

It was entirely usable. With my changes, plus a caching layer, I achieved the required throughput. I think it was something like 5-10 queries within a couple of seconds. So, I think you'll find it fine.

nguyenm100 commented 2 years ago

thanks so much!

nguyenm100 commented 2 years ago

btw- you all might be interested in this convo if you're not there already: https://phabricator.wikimedia.org/T206560

pulquero commented 2 years ago

Interesting, I wasnt aware. IMO, Halyard is the only feasible open-source triple store for big data. I had to wait a year for something like it to come along. I think the only alternative would be something like the now defunct CumulusRDF on cassandra. BTW, it would be entirely possible to refactor Halyard to support different storage backends that use ordered keyspaces.

I'm invested in keeping Halyard at least ticking over, ready for when I might need it again.

nguyenm100 commented 2 years ago

Thanks for the feedback Mark. Maybe you could contribute to some of the discussions over there since you seem to have good breadth on some of these products. Quick question, did you accumulate any tips on debugging slow queries on halyard that you could share? Are you just walking through with the debugger? Did the halyard profile command help you very much?

pulquero commented 2 years ago

Original or my fork? Check for hotspotting on the hbase nodes. I made changes to the key distribution to help balance it out, the original was quite bad at that. Check for thread deadlocking, the push strategy can be quite thread intensive. It shouldn't happen on my fork, there's explicit code to handle it. For debugging, I added toString to a lot of classes to make life easier.Sent from my Galaxy -------- Original message --------From: nguyenm100 @.> Date: 03/02/2022 18:56 (GMT+00:00) To: pulquero/Halyard @.> Cc: pulquero @.>, Comment @.> Subject: Re: [pulquero/Halyard] Halyard (Issue #3) Thanks for the feedback Mark. Maybe you could contribute to some of the discussions over there since you seem to have good breadth on some of these products. Quick question, did you accumulate any tips on debugging slow queries on halyard that you could share? Are you just walking through with the debugger? Did the halyard profile command help you very much?

—Reply to this email directly, view it on GitHub, or unsubscribe.Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.***>

nguyenm100 commented 2 years ago

I think it could be something in the query (StarJoin) optimizer. I say that b/c I'm comparing a query against the same dataset on old and your fork and old takes about a min and your fork runs beyond 5m (I break it before it finishes). I'm using profile to compare the optimized query and the statements are ordered differently. Will dig into it and let you know what I find. tx

pulquero commented 2 years ago

The star join is meant to efficiently evaluate patterns of the form?s p ?x; q ?y; u ?z(Star around ?s)I suspect it has mopped up an ill-suited pattern. You don't by chance happen to have a predicate var?Sent from my Galaxy -------- Original message --------From: nguyenm100 @.> Date: 03/02/2022 20:39 (GMT+00:00) To: pulquero/Halyard @.> Cc: pulquero @.>, Comment @.> Subject: Re: [pulquero/Halyard] Halyard (Issue #3) I think it could be something in the query (StarJoin) optimizer. I say that b/c I'm comparing a query against the same dataset on old and your fork and old takes about a min and your fork runs beyond 5m (I break it before it finishes). I'm using profile to compare the optimized query and the statements are ordered differently. Will dig into it and let you know what I find. tx

—Reply to this email directly, view it on GitHub, or unsubscribe.Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.***>

nguyenm100 commented 2 years ago

Hi, I removed starjoin optimizer and query optimized to original implementation. Will need to dig in to see what's going on. Had a few questions:

Should sjo consider cardinality when optimizing? or is there somewhere I can read to help me better understand the algo?
I see a HalyardConstantOptimizer but couldn't see any difference with vanilla ConstantOptimizer. Am I overlooking something or is there an intent of future optimizing here?

Thanks!

pulquero commented 2 years ago

the stmt patterns in an sjo should be re-ordered here: https://github.com/pulquero/Halyard/blob/aa99f998c69c4964bc4d4a79361b37ffcebbc9bd/strategy/src/main/java/com/msd/gin/halyard/optimizers/HalyardQueryJoinOptimizer.java#L96
Not entirely sure. I might have had something planned. It might come back to me.

nguyenm100 commented 2 years ago

should we be checking for null getContextVar() here before adding? https://github.com/pulquero/Halyard/blob/aa99f998c69c4964bc4d4a79361b37ffcebbc9bd/strategy/src/main/java/com/msd/gin/halyard/optimizers/StarJoinOptimizer.java#L68

Are starjoins only valid in the presence of a context?

pulquero commented 2 years ago

null is considered a context (default graph). So all stmts in default graph should be grouped together, likewise all stmts in a named graph.

nguyenm100 commented 2 years ago

Hey Mark, I'm still debugging the JoinOptimizer. It looks like it could be in rdf4j at this point but I'm still not quite sure. It appears to be skipping a whole subgraph when optimizing certain queries. Will continue digging.

In the meantime, I may have found a small bug in HBaseSailConnection. https://github.com/pulquero/Halyard/blob/latest/sail/src/main/java/com/msd/gin/halyard/sail/HBaseSailConnection.java#L170 appears to have an extra NOT (!) in there and may be some else statements missing wrt calling the optimizers. Can you confirm when you get a sec? tx

pulquero commented 2 years ago

Looks fine, those optimizers should only be applied when not a federated query (ServiceRoot). Can you share your query plan?

nguyenm100 commented 2 years ago

Thanks. I think I misread your comment. I think the bug is in rdf4j (https://github.com/eclipse/rdf4j/issues/3667).

nguyenm100 commented 2 years ago

Hi Mark, dropped you an em at your yahoo account. Please let me know if it was not received. tx

nguyenm100 commented 2 years ago

Heya Mark, wanted to ask a few questions on the impl

I see the keys are, at most, 24 bytes (sub-8, pred-4, obj-8, ctx-6). Is that just an optimization thing where you can hold/traverse more keys in memory? Any idea why those byte counts where used?
I saw you may have converted Adam's full storage of qualifiers to hash qualifiers (2nd hash in case of primary collision?) and use of value? Again, was that an optimization thing?
What's the point of the bit rotation on (some of) the keys? I saw some ref to where Adam was saying it was for lessening hotspots but can't understand why.

Thanks!

pulquero commented 2 years ago

So, if I remember, Adam has a hash as a row key, and the actual triple value as the column qualifier (name). Column value is unused. I felt a bit dissatisfied by that - I felt there must be a better way to make full use of the HBase cell structure. Not to mention storing each triple fully, three different ways takes up a lot of disk space. Anyway, it turns out it is quite hard to come up with something different that doesn't have some significant shortcoming. After a few attempts, I settled on what I have now:

every RDF value has an ID
the ID is split into two, one part is used in the row key, the other part in the column qualifier. The column value contains an encoded/compressed representation of the RDF value itself.
a statement s p o with IDs sid, pid, oid therefore looks something like rowkey: sid[:x]pid[:y]oid[:z], column qualifier: sid[x:]pid[y:]oid[z:], column value: spo
- given a value we can get the ID, we can then do a key scan followed by a qualifier filter to locate a statement. That is we are able to make full use of key scanning and qualifier filtering. We can then decode only the parts of spo we need from the column value. i.e, given s p ?o we skip sp and just decode o from the column value.
- the ID is taken as a hash of the value. To ensure the ID is practically unique, 160-bit SHA-1 is currently used, which should be around asteroid hitting earth level of probability of collision. UUID for instance is 128-bit. And you could upgrade to SHA-256 or even higher at the cost of more disk space, etc. Now, a common way of doing ID assignment is to assign from a sequence. But this is problematic in a distributed system as it essentially means your writes will have to lock on the sequence. Probably the best you could do is having a non-overlapping sequence per node in the cluster, but then how do you manage that with autoscaling/node failover, etc, etc. Anyway, I digress.
- having an ID also means we have a statement-level ID which makes implementing RDF-star fairly straightforward.

As part of the wikidata work, I'm considering adding the option to have an extra ID -> value index to save space by not having to write the column values at the cost of doing the lookup. There is already a limited in-memory version of this for "well-known" IRIs. My current work indicates that Wikidata needs several TBs of space for intermediate files alone to do the load, which might prove to be impractical. Wikidata will also be a good test of the collision resistance of SHA-1.

Right, your questions:

(1) I picked uneven slices based on the cardinality of each part. E.g. for spo index, the cardinality of o given sp is going to be reasonably low so 2 bytes (END_KEY_SIZE) is probably enough to have high confidence we can locate the key we want before applying the column qualifier filter. For osp index, we use 8 bytes (KEY_SIZE) for o as the variation in o is going to be quite high, being in the first position.

(2) basically, the blurb above.

(3) ah, yes, the rotations, an additional detail in how the bytes are organised in the row keys. Let's assume no rotations and we have: s p o1 s p o2 ... s p on in the spo index and o1 q t1 o2 q t2 ... on q tn in the osp index. Now query for s p/q ?t. Start by scanning spo index, locate s p * on region server A. Hit A, read off o1, etc. Got o1, now scan spo index for s=o1, p=q. s=o1 is on region server B, say. Return t1. Get o2 next, now scan spo index for s=o2, p=q; guess where s=o2 is likely to be? Well keyspace is ordered, so if o2 came after o1, then there is a good chance it might also be found on region server B. It might not be immediately next to it, but it is sure to be closer than o3, o4, etc are. So we end up hitting B again, and maybe for o3,..., o10 too. say o11 is on region server C. Now we hit C. Again o12,.... is likely to be on C too, so strafe across C etc. And you can see this happening live in the HBase web UI - effectively sequential strafing of region servers. I remember this coming as quite a disappointment to me when I first saw this happen, so much for distributing load.

Now add rotation and replay.... get o2 next, now scan spo index for s=rotate(o2), p=q. Well, rotate(o2) is now not likely to be sequentially close to o1, it could be anywhere! We hit some other region server, load is distributed. In the web UI you can see multiple region servers being used simultaneously, win.

Future features/thoughts

HBase supports cell-level security. It should be possible to use this to implement statement level security, and I think Halyard would be the only triple store to offer such a feature. I've already done something similar with cell-level timestamps: (?s ?p ?o) halyard:timestamp ?t I guess I should look into adding an RDF-star equivalent at some point <<?s ?p ?o>> halyard:timestamp ?t

nguyenm100 commented 2 years ago

Thanks for this Mark. Makes sense. Am actually looking at rocksdb as a less heavy handed approach for smaller datasets. I say small but it can probably still handle a few billion triples. rdb doesn't support columns but does support k/v & cf and scanning. cf might be handy for separating out spo, pos, osp, etc. instead of prefixing like what Halyard is doing (I wonder if that would be useful for hbase). I'm kicking around some thoughts for keys since I don't have qualifiers in rdb including:

using the entire triple or quad as the key and skip the value (no chance for collision). probably could just dictionary the uri prefix as an optimization... though now that I think about it, if you dictionary EVERYTHING but maybe literals, how bad could that be for a few billion triples/quads in practice?
bytes[] = uuid(s) + uuid(p) + uuid(o) + uuid(c) = 64bytes. very very low chance of collision (should be better than SHA-256) and more than adequate for expected max size of rocksdb.

I don't believe hotspotting would come into play with rdb but am not sure so might be able to skip rotation.

pulquero commented 2 years ago

I'll look into your idea of using cf instead of prefixes, should greatly simplify things. As rdb is single box, then dictionary should work well, which you could put wrap in a LRU cache too. I would check documentation around key size and performance. I'm going to guess a small key will be most performant, so maybe avoid using the entire triple as-is. Dictionary vs hash: going to be between size and speed, and read vs write. Though less size might mean more cached and therefore more speed. Looks like rdb has counters, so you could use it as a dictionary. But, you are going to need reverse lookup too. Minimum effort? (2), so maybe I would tend towards (2). I mean the only difference between (2) and (1) is how ID assignment is done, plus extra storage optimization, so you could write (2) in a way to make (1) not much extra work if needed.

I did have an AbstractHalyardSail that was HBase independent, and assumed no more than an ordered kv-store, but I lost the branch.

pulquero commented 2 years ago

One issue with cf vs prefix is cf means a shared keyspace across the indices. That means you cant have different region splits for each index. So you could have a region which was very sparse in one column family and dense in another. What you would have to do is have separate tables per index rather than column families, which is something I've considered before, but it's a bit untidy trying to handle multiple tables.

nguyenm100 commented 2 years ago

does uuid5.hash() with different namespaces for spo, pos, etc. work?

pulquero commented 2 years ago

uuid5 is truncated to 128bits and the namespace is lost in the hashing so there is no prefix to use in your scans.

nguyenm100 commented 2 years ago

it would work if you uuid hash each of the s, p, o, c and concat the hash into a 64byte key. my (2) from above?

nguyenm100 commented 2 years ago

perhaps I'm misunderstanding the enumerated prefixes. I thought they were used in case a subject hash collided with a predicate hash? if so, dramatically decreasing the chance of collision (you can always sanity check the returned value while iterating) should work no?

pulquero commented 2 years ago

Ah, no. The prefixes are used to separate the indices from one another, e.g. 0spo, 1pos, 2osp. So, for any given statement pattern, you decide the best index to use and use the prefix to restrict scanning within the keyspace of that index.

pulquero commented 2 years ago

I mean you are right in a sense, but I wouldnt phrase it that way. I mean avoiding collision is necessary but not sufficient. You do need to distinguish between spo and pos but more than that, you need to also be able to scan all spo or all pos, which means the row keys need to be ordered in the MSB.

nguyenm100 commented 2 years ago

ah ok. I missed the case where you need to scan by permutation. btw- I see you're still updating your repo. I'm guessing (hoping) everything is backwards compat so far? :-)

pulquero commented 2 years ago

No, I'm afraid not, you will need to reload your data :( I've improved key distribution (breaking change), elimated alot of buffer copying, reduced serialization size. Should be much leaner and meaner now :)

nguyenm100 commented 2 years ago

Heya Mark, as I'm going through some more of the code, I see the call to HalyardTableUtils.toKeyValues(). I understand we are creating keys for all permutations of the cspo, but why do we need to also permute the qualifiers & values? couldn't they always remain in correct order?

pulquero commented 2 years ago

I guess they could. I think the consistent ordering might make the code simpler, but apart from that I dont think there is any other benefit to be had.

nguyenm100 commented 2 years ago

Ok thanks. Just sanity checking. Also, how come we have a start and stop key? Don't we always want to only match prefixes until it doesn't match anymore? not sure I understand the purpose of a stop-key.

ex:

<xxxxx><yyyyy><aaaaa>
<xxxxx><yyyyy><bbbbb>
<xxxxx><zzzzz><aaaaa>
<yyyyy><zzzzz><aaaaa>

if we're looking for \<xxxxx> ?p ?o we would get 3 recs. If we're looking for \<xxxxx>\<yyyyy> ?o we would get 2 recs, and so forth.

pulquero commented 2 years ago

Fundamentally, HBase scans between a start and stop key. There is a PrefixFilter which takes a prefix xyz and scans between xyz and xyz\0. Let me emphasize that HBase keys are sorted lexicographically, but all Halyard keys are fixed length which makes things simpler.

nguyenm100 commented 2 years ago

ok thanks. rocksdb just gives you an iterator. i suppose hbase could prefetch better if it knows the endpoints

nguyenm100 commented 2 years ago

Hey Mark, re:Identifier.value. Looks like value is a SHA-1 20byte hash but value[1] is a type byte. Any reason why we wouldn't make value a 21 byte buffer with value[0]=type & value[1..21] = SHA-1 or just store the type byte in a separate var?

pulquero commented 2 years ago

The type ID only uses 4 bits of a full byte, so it is easier to reduce the hash by 4-bits rather than extend it. Though we aren't really reducing the hash in the sense that if the type is distinct then the value is distinct. This is a bit like a UUID which has a deterministic part (eg MAC address) and a random part. The type bits can be used to reduce the scan range when we know we are looking for a specific type, see scanLiterals(). If we used value[0], then the ordering would mean all the literals would clump together which is good when you just want to scan literals but at the expense of an overall bad key distribution. Using value[1] gives us a salt of 256 values which should be enough to get a good key distribution but at the expense of having to scan 256 separate literal ranges which should be a reasonable middle-ground. Soon, some of this will be more configurable.

pulquero commented 2 years ago

I now have experimental verification that a 64-bit hash is fine for Wikidata.

nguyenm100 commented 2 years ago

I noticed they lowered the priority on Halyard and really just looking at the usual suspects (rdf4j, jena, virtuoso, qlever). Are you still chatting with them? I've got the key hashing and serializing/deserializing working for rocksdb (at least the tests are passing). need to wire up the sail layer next before being able to test it standalone. using full 20byte sha-1 so far but not really looking at optimizing until everything else works. not rotating the values for each index requires some juggling but should save some space if (when?) we start using string tables. also simplifies some code (but requires an extra step to parse the identifier).

nguyenm100 commented 2 years ago

Hey Marc, is the HBaseSailReificationTest working for you by chance?

pulquero commented 2 years ago

Yes, CI is passing. https://github.com/pulquero/Halyard/runs/5978803130?check_suite_focus=true#step:5:4215

nguyenm100 commented 2 years ago

ok thanks. not sure what's going on with my stuff. it's the last unit test that's failing on my rocksdb impl. will keep at it.

btw- are you going to kgc this year? it's in nyc. if so, maybe we can catch up.

nguyenm100 commented 2 years ago

hey Marc, was debugging on windows and noticed that the Hashes.hash16 was causing collisions for the well_known_langs dictionary. I have 750 odd languages on this machine for some reason. anyways, I switched over to a crc16 algo instead of the pearsons one he's using and it seems to work (https://alvinalexander.com/java/jwarehouse/openjdk-8/jdk/src/share/classes/sun/misc/CRC16.java.shtml)

pulquero commented 2 years ago

Ok, I'll swap it to crc16 then. Crc16 has done quite well then! 1-byte will hold 256 values, and conventional wisdom is you need ~ 2x bytes worth of hash to avoid collisions. I also feel I should have a resource file with a fixed set of languages to avoid any platform dependence, but that's not urgent (so it is not forgotten: https://github.com/pulquero/Halyard/issues/9).

What is the particular error you have with HBaseSailReificationTest? You need to be using an evaluation strategy that supports tuple functions.

It would be nice to go to kgc, but I doubt there is the budget to send me.

Currently, it looks like wikidata will required <2TB of disk space, but I'm constrained by the box I'm testing on to ~1TB, so loading what I can. Expect to move on to query performance shortly, then I'll probably publish results back to the wikidata thread.

pulquero commented 2 years ago

Upgraded to fast lookup table based CRC16.

nguyenm100 commented 2 years ago

cool. will take a look at the tabled crc16 version.

re: reif - not sure what's going on. the first 4 selects work and then it starts failing on this one: https://github.com/pulquero/Halyard/blob/latest/sail/src/test/java/com/msd/gin/halyard/sail/HBaseSailReificationTest.java#L75

I've noticed some things fail when not using the halyardevaluationstrategy (more bug than features) so have patched a few things there..

re: data size. I think you'll save at least 2/3 space if you store the values in a string table b/c things are indexed at least 3x. that said, you'll need to avoid rotating the values like he does with the keys.

re: code - i have things reading/writing to rocksdb so will spend some time integration testing and optimizing a little. I kinda feel like there are some good opportunities to do so after spending some time tracing rdf4j/halyard interaction.

pulquero commented 2 years ago

String table would incur an extra lookup though. I was thinking about an rdf term table, but that would be three lookups per statement and would be unaffected by value rotation. However, I think if I'm clever I wouldnt even need to do the lookups as most joins can be done using IDs alone. I would only need to materialize the values on-demand, eg at output or for FILTER. Given other stores take ~1TB, not too worried atm, but there is room to improve if needed.

nguyenm100 commented 2 years ago

Hey Marc, I'm trying to upgrade my stuff to RDF4J 4.x and hitting some test failures in: https://github.com/pulquero/Halyard/blob/latest/strategy/src/test/java/com/msd/gin/halyard/strategy/HalyardSPARQLQueryTest.java#L43

I noticed you may have added entries to the defaultIgnoreTest list (orig: https://github.com/Merck/Halyard/blob/master/strategy/src/test/java/com/msd/gin/halyard/strategy/HalyardSPARQLQueryTest.java#L31) and curious what that is about. RDF4J 4.x is failing on an additional 2 tests and I'm trying to ascertain what the issue could be and/or if it's possible to skip them. tx

pulquero commented 2 years ago

The added entries should just reflect what entries are in the rdf4j parent class. I think it's a private static method hence copy & paste.

pulquero / Halyard

Halyard #3