Open baliberdin opened 5 years ago
The Solr error indicates that the schema is set to be immutable, but Solr is also configured to add fields when a document comes in that has fields that have not yet been defined in the schema.
The mutability of the schema is set by the schemaFactory configuration, which by default is the managed schema. This is configured in the solrconfig.xml
file. This section of the Reference Guide explains how to configure that to make it mutable=true, if you want: https://lucene.apache.org/solr/guide/7_5/schema-factory-definition-in-solrconfig.html#solr-uses-managed-schema-by-default
If you prefer to disable Solr adding new fields entirely, you can disable automatic field guessing by either editing out all the parts of solrconfig.xml
that set up field guessing, or by restarting Solr and setting the property update. autoCreateFields
to false. This section of the Reference Guide explains how to do that:
https://lucene.apache.org/solr/guide/7_5/schemaless-mode.html#disabling-automatic-field-guessing
Also, spark-solr only creates fields if they don't exist already. If you don't want spark-solr to not create fields, then pre-create all your fields before starting your index operation
Thanks ( @ctargett & @kiranchitturi ) for the explanation. But, I think that I didn't give you enough information about the issue.
@ctargett: Exactly! My Solr Schema is immutable, and I don't want that spark-solr try to create new fields. Why? Because I would like to manage them (just for control) and maintain certain characteristics of my fields that spark-solr does not can do for me. Something like set stored=false, termVectors, fieldType... I prefer that spark-solr throws an Exception, if a field does not exist than to try to create them. Maybe it could be a spark-solr config =)
@kiranchitturi All the fields have already created. But spark-solr try to create them every time. If I turn my schema into mutable all errors disappears, but spark-solr still try to create them every single time.
I think that spark-solr hopes that fields must be configured in a specific way (eg: stored=true), then if they are not, spark-solr try to create or modify these fields over and over again.
I will try to run an example and put here more information. Thank you
Do you have the field guessing (aka Schemaless) disabled? If so, Solr should be throwing an error when it encounters unknown fields that the field in a document to be inserted does not exist.
The other thing is if you want control, you would want to check that you have removed all the dynamic field rules. The field example you provided category_multi_str
would match the default Solr dynamic field rule in the schema. It looks like this out of the box:
<dynamicField name="*_str" type="strings" stored="false" docValues="true" indexed="false" useDocValuesAsStored="false"/>
I think, though, that since the error is about the schema being immutable, the field guessing is still enabled.
Spark-solr really knows nothing about Solr's schema, and I'm not sure it should really care. It's Solr's behavior that is the problem here, so let's make sure that you have everything in Solr set up correctly for how you want it to work.
@ctargett Spark Solr knows about the solr schema when asks Solr about the Fields
Here spark-solr get schema information from solr: SolrRelation.scala#L1002
Here SolrQuerySupport.scala#L271 Spark-Solr starts to check various field configurations and filter them by some criteria and put all those fields that he likes in a list.
Here SolrRelation.scala#L1004 Spark-Solr iterates document fields and check if they exist on that filtered list. If they dont, the field is added in a list to be created.
What I'm trying to say is that: if spark-solr stop to try to update my schema (that is immutable) and just post the documents to Solr, everything will be fine =). All the needed fields are there.
Maybe this config is the answer: SolrQuerySupport.scala#L266 skipFieldCheck If I find a way to set this config before indexing my documents, I guess it will work.
ah, yeah. There is no config option to disable the field additions right now. We should add one :)
On Fri, Mar 8, 2019 at 1:32 PM Allan Baliberdin notifications@github.com wrote:
@ctargett https://github.com/ctargett Spark Solr knows about the solr schema when asks Solr about the Fields
Here spark-solr get schema information from solr: SolrRelation.scala#L1002 https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L1002
Here SolrQuerySupport.scala#L271 https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/util/SolrQuerySupport.scala#L271 Spark-Solr starts to check various field configurations and filter them by some criteria and put all those fields that he likes in a list.
Here SolrRelation.scala#L1004 https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L1004 Spark-Solr iterates document fields and check if they exist on that filtered list. If they dont, the field is added in a list to be created.
What I'm trying to say is that: if spark-solr stop to try to update my schema (that is immutable) and just post the documents to Solr, everything will be fine =). All the needed fields are there.
Maybe this config is the answer: SolrQuerySupport.scala#L266 https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/util/SolrQuerySupport.scala#L266 skipFieldCheck If I find a way to set this config before indexing my documents, I guess it will work.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/lucidworks/spark-solr/issues/246#issuecomment-471083251, or mute the thread https://github.com/notifications/unsubscribe-auth/AEi5oIe0J-LvW4a6F43grWsJ2SGgE77mks5vUtbvgaJpZM4aWXRS .
Contributions are welcome ;)
@kiranchitturi I'm not a Scala expert, but I'll try to work on a patch for that. Thank you!
The Solr error indicates that the schema is set to be immutable, but Solr is also configured to add fields when a document comes in that has fields that have not yet been defined in the schema.
The mutability of the schema is set by the schemaFactory configuration, which by default is the managed schema. This is configured in the
solrconfig.xml
file. This section of the Reference Guide explains how to configure that to make it mutable=true, if you want: https://lucene.apache.org/solr/guide/7_5/schema-factory-definition-in-solrconfig.html#solr-uses-managed-schema-by-defaultIf you prefer to disable Solr adding new fields entirely, you can disable automatic field guessing by either editing out all the parts of
solrconfig.xml
that set up field guessing, or by restarting Solr and setting the propertyupdate. autoCreateFields
to false. This section of the Reference Guide explains how to do that: https://lucene.apache.org/solr/guide/7_5/schemaless-mode.html#disabling-automatic-field-guessing
is there are way to disable these automatic fields when creating a collection? or is there a way to globally disable this on all collections while keeping Schemaless mode?
I think the PR that was added would do it from the spark side... One thought though, is could you leverage https://solr.apache.org/guide/solr/latest/configuration-guide/update-request-processors.html and have one that drops any not matching fields on the Solr side?
How can I prevent spark-solr to update my schema? I understand that this feature is useful, but in my case isn't. It would be nice to disable this feature in some cases.
Spark = 2.2.0 Spark-Solr = 3.3.4 Solr = 7.5.0
Spark-Solr trying to add new field
Spark-Solr - Error on update schema
Solr error