lucidworks / spark-solr

Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
Apache License 2.0
445 stars 250 forks source link

How to Disable Schema Update? #246

Open baliberdin opened 5 years ago

baliberdin commented 5 years ago

How can I prevent spark-solr to update my schema? I understand that this feature is useful, but in my case isn't. It would be nice to disable this feature in some cases.

Spark = 2.2.0 Spark-Solr = 3.3.4 Solr = 7.5.0

Spark-Solr trying to add new field

19/01/28 16:03:13 INFO SolrRelation: adding new field: {name=category_multi_str, indexed=true, multiValued=true, docValues=true, stored=true, type=string}

Spark-Solr - Error on update schema

19/01/28 16:03:13 ERROR CloudSolrClient: Request to collection mycollection failed due to (400) org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://mysolrserver:8983/solr/mycollection: error processing commands, retry? 0 Exception in thread "main" org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://mysolrserver:8983/solr/mycollection: error processing commands

Solr error

org.apache.solr.api.ApiBag$ExceptionWithErrObject: error processing commands, errors: [{errorMessages=schema is not editable}]

ctargett commented 5 years ago

The Solr error indicates that the schema is set to be immutable, but Solr is also configured to add fields when a document comes in that has fields that have not yet been defined in the schema.

The mutability of the schema is set by the schemaFactory configuration, which by default is the managed schema. This is configured in the solrconfig.xml file. This section of the Reference Guide explains how to configure that to make it mutable=true, if you want: https://lucene.apache.org/solr/guide/7_5/schema-factory-definition-in-solrconfig.html#solr-uses-managed-schema-by-default

If you prefer to disable Solr adding new fields entirely, you can disable automatic field guessing by either editing out all the parts of solrconfig.xml that set up field guessing, or by restarting Solr and setting the property update. autoCreateFields to false. This section of the Reference Guide explains how to do that: https://lucene.apache.org/solr/guide/7_5/schemaless-mode.html#disabling-automatic-field-guessing

kiranchitturi commented 5 years ago

Also, spark-solr only creates fields if they don't exist already. If you don't want spark-solr to not create fields, then pre-create all your fields before starting your index operation

baliberdin commented 5 years ago

Thanks ( @ctargett & @kiranchitturi ) for the explanation. But, I think that I didn't give you enough information about the issue.

@ctargett: Exactly! My Solr Schema is immutable, and I don't want that spark-solr try to create new fields. Why? Because I would like to manage them (just for control) and maintain certain characteristics of my fields that spark-solr does not can do for me. Something like set stored=false, termVectors, fieldType... I prefer that spark-solr throws an Exception, if a field does not exist than to try to create them. Maybe it could be a spark-solr config =)

@kiranchitturi All the fields have already created. But spark-solr try to create them every time. If I turn my schema into mutable all errors disappears, but spark-solr still try to create them every single time.

I think that spark-solr hopes that fields must be configured in a specific way (eg: stored=true), then if they are not, spark-solr try to create or modify these fields over and over again.

I will try to run an example and put here more information. Thank you

ctargett commented 5 years ago

Do you have the field guessing (aka Schemaless) disabled? If so, Solr should be throwing an error when it encounters unknown fields that the field in a document to be inserted does not exist.

The other thing is if you want control, you would want to check that you have removed all the dynamic field rules. The field example you provided category_multi_str would match the default Solr dynamic field rule in the schema. It looks like this out of the box:

<dynamicField name="*_str" type="strings" stored="false" docValues="true" indexed="false" useDocValuesAsStored="false"/>

I think, though, that since the error is about the schema being immutable, the field guessing is still enabled.

Spark-solr really knows nothing about Solr's schema, and I'm not sure it should really care. It's Solr's behavior that is the problem here, so let's make sure that you have everything in Solr set up correctly for how you want it to work.

baliberdin commented 5 years ago

@ctargett Spark Solr knows about the solr schema when asks Solr about the Fields

Here spark-solr get schema information from solr: SolrRelation.scala#L1002

Here SolrQuerySupport.scala#L271 Spark-Solr starts to check various field configurations and filter them by some criteria and put all those fields that he likes in a list.

Here SolrRelation.scala#L1004 Spark-Solr iterates document fields and check if they exist on that filtered list. If they dont, the field is added in a list to be created.

What I'm trying to say is that: if spark-solr stop to try to update my schema (that is immutable) and just post the documents to Solr, everything will be fine =). All the needed fields are there.

Maybe this config is the answer: SolrQuerySupport.scala#L266 skipFieldCheck If I find a way to set this config before indexing my documents, I guess it will work.

kiranchitturi commented 5 years ago

ah, yeah. There is no config option to disable the field additions right now. We should add one :)

On Fri, Mar 8, 2019 at 1:32 PM Allan Baliberdin notifications@github.com wrote:

@ctargett https://github.com/ctargett Spark Solr knows about the solr schema when asks Solr about the Fields

Here spark-solr get schema information from solr: SolrRelation.scala#L1002 https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L1002

Here SolrQuerySupport.scala#L271 https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/util/SolrQuerySupport.scala#L271 Spark-Solr starts to check various field configurations and filter them by some criteria and put all those fields that he likes in a list.

Here SolrRelation.scala#L1004 https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L1004 Spark-Solr iterates document fields and check if they exist on that filtered list. If they dont, the field is added in a list to be created.

What I'm trying to say is that: if spark-solr stop to try to update my schema (that is immutable) and just post the documents to Solr, everything will be fine =). All the needed fields are there.

Maybe this config is the answer: SolrQuerySupport.scala#L266 https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/util/SolrQuerySupport.scala#L266 skipFieldCheck If I find a way to set this config before indexing my documents, I guess it will work.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lucidworks/spark-solr/issues/246#issuecomment-471083251, or mute the thread https://github.com/notifications/unsubscribe-auth/AEi5oIe0J-LvW4a6F43grWsJ2SGgE77mks5vUtbvgaJpZM4aWXRS .

kiranchitturi commented 5 years ago

Contributions are welcome ;)

baliberdin commented 5 years ago

@kiranchitturi I'm not a Scala expert, but I'll try to work on a patch for that. Thank you!

stefannesic commented 1 year ago

The Solr error indicates that the schema is set to be immutable, but Solr is also configured to add fields when a document comes in that has fields that have not yet been defined in the schema.

The mutability of the schema is set by the schemaFactory configuration, which by default is the managed schema. This is configured in the solrconfig.xml file. This section of the Reference Guide explains how to configure that to make it mutable=true, if you want: https://lucene.apache.org/solr/guide/7_5/schema-factory-definition-in-solrconfig.html#solr-uses-managed-schema-by-default

If you prefer to disable Solr adding new fields entirely, you can disable automatic field guessing by either editing out all the parts of solrconfig.xml that set up field guessing, or by restarting Solr and setting the property update. autoCreateFields to false. This section of the Reference Guide explains how to do that: https://lucene.apache.org/solr/guide/7_5/schemaless-mode.html#disabling-automatic-field-guessing

is there are way to disable these automatic fields when creating a collection? or is there a way to globally disable this on all collections while keeping Schemaless mode?

epugh commented 1 year ago

I think the PR that was added would do it from the spark side... One thought though, is could you leverage https://solr.apache.org/guide/solr/latest/configuration-guide/update-request-processors.html and have one that drops any not matching fields on the Solr side?