spaziocodice / SolRDF

An RDF plugin for Solr
Apache License 2.0
114 stars 20 forks source link

Integration with SolrCloud (RDF Mode) #44

Closed agazzarini closed 9 years ago

agazzarini commented 9 years ago

Move the current (standalone) implementation to SolrCloud

agazzarini commented 9 years ago

Finally I got distributed (RDF) indexing but I had to fight (loosing of course) against SOLR-3473 [1]. I didn't get the exact point of what was happening, but at the end I had to remove the custom update chain with the SignatureUpdateProcessor. That "auto" id feature has been replaced with a custom function in the SolRDFGraph.


[1] https://issues.apache.org/jira/browse/SOLR-3473

agazzarini commented 9 years ago

Code is still in r_solrcloud branch as the "querying" part is not working. In addition, in order to maintain retrocompatibility, I need to implement the issue #61 before.

agazzarini commented 9 years ago

A great (at least I think) step ahead: I successfully ran a SolRDF in Cloud mode (i.e. SolrCloud) with a small cluster (5 nodes, 3 shards, 2 replicas).

It is still part of the r_solrcloud dev branch, so not yet merged in the master.

Both the indexing and querying operations are distributed according with SolrCloud behaviour. That means a free of charge, simple and proven distributed RDF store that can be enhanced with Solr / Lucene fulltext capabilities. Unfortunately at the moment I have some problem with the "hybrid" mode because I need to understand better how distributed faceting works.

As I said, I need to investigate further some aspects, but I think this is a great step ahead.

I will release these changes as soon as possible.

agazzarini commented 9 years ago

Work done for integration SolrCloud has been merged because the previous (standalone) behaviour hasn't been affected: instead of having a DatasetGraphFactory as illustrated in the picture above, the component which is in charge to detect the appropriate DatasetGraph implementation is the SPARQLSearchComponent, which "switches" in standalone or cloud mode depending on the kind of running instance (it tries to see if the node is aware of ZooKeeper).

agazzarini commented 9 years ago

Graph(s) hierarchy has been changed a bit in order to centralize in a supertype layer (SolRDFGraph) all common behaviours (common between LocalGraph and ReadOnlyCloudGraph).

At the moment distribuited indexing works, but a lot of queries from the integration suite (correctly working in standalone mode) fail. For some of them the reason is the autocommit interval which is not running in the same moment within the cluster, but there's also some other issue that I'm going to investigate.

However, I think a big step ahead has been done.

agazzarini commented 9 years ago

The main problem with the failing tests has (fortunately) nothing to do with the autocommit. The problem is the different char encoding rules between the LocalGraph, which is indexing and querying locally without any network (http) call, and the ReadOnlyCloudGraph which is doing a lot of http round-trips to distribute queries. I got that working after several tries but I didn't commit as there are still some strange (and random) behaviours

agazzarini commented 9 years ago

Partial results have been merged in master as the SELECT, DESCRIBE, ASK and CONSTRUCT integration tests are all working in SolrCloud mode.

The r_rdf_mode_in_solrcloud branch is still in progress as there are some failures related with SPARQL updates.

agazzarini commented 9 years ago

After fighting a bit (mostly against stupid things like URL encoding) I finally got SolRDF working in SolrCloud mode! The whole SPARQL 1.1 integration test suite (more or less 150 SELECT, ASK, CONSTRUCT and UPDATE examples) is working. There are just few minor test cases that have been @Ignored (for each of them there's a dedicated issue)

Although I think something has to be improved on queries optimization side, I think this is a great step ahead towards a powerful, scalable and Solr-based RDF Store / SPARQL Endpoint.