lucidworks / spark-solr

Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
Apache License 2.0
446 stars 250 forks source link

FUS-3017: Spark-Solr performance improvements #358

Closed joel-bernstein closed 1 year ago

joel-bernstein commented 1 year ago

There are two small changes in Spark-Solr that will have a big impact on performance and usability.

  1. Set the default splits_per_shard to 1. By default if splits_per_shard is not set spark-solr uses a formula to compute splits_per_shard which can create a huge number of splits which will grind the cluster to a halt. Having the default behavior result in a trap is very detrimental to usability. It's very difficult for the typical user to understand the effect of splits_per_shard so the safest approach is to simply default to 1. Users can experiment with increasing this number to see if it improves performance or causes problems.

  2. Change the default export handler sort field from id to _version_. Spark does not utilize the sort order anyway so spark-solr should simply sort on the most performant field. _version_ is a long type which is much more efficient to sort on then the id which is a string.