spaziocodice / SolRDF

An RDF plugin for Solr
Apache License 2.0
114 stars 20 forks source link

Dynamically Bootstrap Named Analysed Fields for Searching and Boosting #69

Open ahagenbruch opened 9 years ago

ahagenbruch commented 9 years ago

Hi @agazzarini, the current schema in SolRDF is mostly focused on the use case as a SPARQL endpoint, i.e. its object literals are being indexed into unanalysed string fields. To accomodate a more common use case where we also want to be able to do analysed field searching and per field boosting we could write object literals into named fields derived from the QNames. As Solr provides the mechanism of dynamic fields we propose the following enhancement:

Transform the QName and optional datatype and language information into a field name of the following structure:

prefix_predicateName[_datatype][_lang]

Use abstract heuristics to provide a basic search schema. This can be adapted to the actual requirements of the dataset. We make the genral assumption that all fields can have multiple values:

Map untyped and language less literals to text_general: <dynamicField name="*_xsd_string" type="text_general" indexed="true" stored="true" multiValued="true"/>

Map literals with language information to corresponding language text fields: <dynamicField name="*_xsd_string_de" type="text_de" indexed="true" stored="true" multiValued="true"/> ...

Map typed literals with datatypes to corresponding fields: xsd:integer => <dynamicField name="*_xsd_integer" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:nonPositiveInteger => <dynamicField name="*_xsd_nonPositiveInteger" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:NegativeInteger => <dynamicField name="*_xsd_negativeInteger" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:long => <dynamicField name="*_xsd_long" type="tlong" indexed="true" stored="true" multiValued="true"/> xsd:unsignedLong => <dynamicField name="*_xsd_unsignedLong" type="tlong" indexed="true" stored="true" multiValued="true"/> xsd:int => <dynamicField name="*_xsd_int" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:unsignedInt => <dynamicField name="*_xsd_unsignedInt" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:short => <dynamicField name="*_xsd_short" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:unsignedShort => <dynamicField name="*_xsd_unsignedShort" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:byte => <dynamicField name="*_xsd_byte" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:unsignedByte => <dynamicField name="*_xsd_unsignedByte" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:nonNegativeInteger => <dynamicField name="*_xsd_nonNegativeInteger" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:positiveInteger => <dynamicField name="*_xsd_positiveInteger" type="tint" indexed="true" stored="true" multiValued="true"/> xsd:float => <dynamicField name="*_xsd_float" type="tfloat" indexed="true" stored="true" multiValued="true"/> xsd:decimal => <dynamicField name="*_xsd_decimal" type="tfloat" indexed="true" stored="true" multiValued="true"/> xsd:double => <dynamicField name="*_xsd_double" type="tdouble" indexed="true" stored="true" multiValued="true"/> xsd:boolean => <dynamicField name="*_xsd_boolean" type="boolean" indexed="true" stored="true" multiValued="true"/> xsd:string => <dynamicField name="*_xsd_string" type="text_general" indexed="true" stored="true" multiValued="true"/> xsd:hexBinary => <dynamicField name="*_xsd_hexBinary" type="string" indexed="true" stored="true" multiValued="true"/> xsd:base64Binary => <dynamicField name="*_xsd_base64Binary" type="binary" indexed="true" stored="true" multiValued="true"/> xsd:anyURI => <dynamicField name="*_xsd_anyURI" type="string" indexed="true" stored="true" multiValued="true"/> xsd:QName => <dynamicField name="*_xsd_QName" type="string" indexed="true" stored="true" multiValued="true"/> xsd:NOTATION => <dynamicField name="*_xsd_NOTATION" type="string" indexed="true" stored="true" multiValued="true"/> xsd:normalizedString => <dynamicField name="*_xsd_normalizedString" type="text_general" indexed="true" stored="true" multiValued="true"/> xsd:token => <dynamicField name="*_xsd_token" type="text_general" indexed="true" stored="true" multiValued="true"/> xsd:language => <dynamicField name="*_xsd_language" type="string" indexed="true" stored="true" multiValued="true"/> xsd:IDREFS => <dynamicField name="*_xsd_IDREFS" type="string" indexed="true" stored="true" multiValued="true"/> xsd:IDREF => <dynamicField name="*_xsd_IDREF" type="string" indexed="true" stored="true" multiValued="true"/> xsd:ENTITIES => <dynamicField name="*_xsd_ENTITIES" type="string" indexed="true" stored="true" multiValued="true"/> xsd:ENTITY => <dynamicField name="*_xsd_ENTITY" type="string" indexed="true" stored="true" multiValued="true"/> xsd:NMTOKENS => <dynamicField name="*_xsd_NMTOKENS" type="string" indexed="true" stored="true" multiValued="true"/> xsd:Name => <dynamicField name="*_xsd_Name" type="string" indexed="true" stored="true" multiValued="true"/> xsd:NCName => <dynamicField name="*_xsd_NCName" type="string" indexed="true" stored="true" multiValued="true"/> xsd:ID => <dynamicField name="*_xsd_ID" type="string" indexed="true" stored="true" multiValued="true"/>

Map date and dateTime types to a date field and supplement the missing values (e.g. "2015" => "2015-01-01T00:00:00Z"): xsd:date => <dynamicField name="*_xsd_date" type="tdate" indexed="true" stored="true" multiValued="true"/>

Map duration to a string field: xsd:duration => <dynamicField name="*_xsd_duration" type="string" indexed="true" stored="true" multiValued="true"/>

Map Gregorian date fields to a string field: xsd:gYearMonth => <dynamicField name="*_xsd_gYearMonth" type="string" indexed="true" stored="true" multiValued="true"/> xsd:gYear => <dynamicField name="*_xsd_gYear" type="string" indexed="true" stored="true" multiValued="true"/> xsd:gMonthDay => <dynamicField name="*_xsd_gMonthDay" type="string" indexed="true" stored="true" multiValued="true"/> xsd:gDay => <dynamicField name="*_xsd_gDay" type="string" indexed="true" stored="true" multiValued="true"/> xsd:gMonth => <dynamicField name="*_xsd_gMonth" type="string" indexed="true" stored="true" multiValued="true"/>

agazzarini commented 9 years ago

Hi @ahagenbruch sounds really interesting. Many thanks for such detailed proposal. I introduced the "Hybrid" mode for mixing Solr and plain RDF features so that could be something that goes under that direction. I strongly agree with you that StrFields have a limited power in terms of querying capabilities.

I have to read again your proposal and then investigate what kind of impacts it should have on the existing code. In the meantime a question: let's suppose we changed the schema in such way. What kind of queries are you issuing to SolRDF? I think, using plain SPARQL, you won't get any benefit from such schema. Do you want to use Solr built-in parsers and get results in SPARQL-results?

Thanks again


BTW: I created a user list on google. If you want feel free to join us. We could discuss about this thing also with other (few at the moment) users.

agazzarini commented 9 years ago

@ahagenbruch I'm moving the discussion back here as these are concrete implementation details. Two doubts:

Field name

You said, in your proposal:

prefix_predicateName[_datatype][_lang] 

What about the prefix? In your schema example we have a skos:notation and ok, skos is a widely used / standard namespace. But what about custom namespaces? It doesn't sound good to index something like:

pippo_mynote_xsd_string 

because "pippo" could be known only at index time; at query time you couldn't be aware about prefixes I previously used in indexing or, you could use the same namespace mapped with a different prefix (e.g. pluto:mynote at query time and pippo:mynote at index time, where pippo and pluto points to the same namespace URI)

Multivalued fields

You said

We make the general assumption that all fields can have multiple values

Why? Each triple (i.e. each document) will have exactly one value for the object field, regardless the schema we will use. Am I missing something about your proposal?

ahagenbruch commented 9 years ago

Am 18.05.15 um 15:11 schrieb Andrea Gazzarini:

Hi Andrea,

You said, in your proposal:

|prefix_predicateName[_datatype][_lang] |

What about the prefix? In your schema example we have a skos:notation and ok, skos is a widely used / standard namespace. But what about custom namespaces? It doesn't sound good to index something like:

|pippo_mynote_xsd_string |

because "pippo" could be known only at index time; at query time you couldn't be aware about prefixes I previously used in indexing or, you could use the same namespace mapped with a different prefix (e.g. pluto:mynote at query time and pippo:mynote at index time, where pippo and pluto points to the same namespace URI)

I see your point, but I had these two use cases in mind when I wrote the proposal:

Multivalued fields

You said

We make the general assumption that all fields can have multiple values

Why? Each triple (i.e. each document) will have exactly one value for the object field, regardless the schema we will use. Am I missing something about your proposal?

By document I mean the subject URI as the document ID, the predicates as field names and the object literals as their values. As we can't know in advance which of our predicates might hold a list of objects* the safe way seems to make all fields multi valued in the most general schema I proposed. If (as in my other two example schemas) you tailor the fields more to your dataset's needs, you probably don't want to make fields for which you know that they are single valued multi valued...

<thsys/72180> a skos:Concept, zbwext:Thsys ; rdfs:label "Statistics"@en, "Statistik"@de ; ...

Cheers,

Andre