lucidworks / spark-solr

Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
Apache License 2.0
445 stars 250 forks source link

Spark-Solr can't load non-stored multivalued fields with docValues=true and useDocValuesAsStored=true #307

Open uyilmaz opened 3 years ago

uyilmaz commented 3 years ago

Using Solr 8.4.0, Spark-Solr 3.6.1 Spark: 2.11

When a field is configured with:

stored="false" docValues="true" useDocValuesAsStored="true"

in Solr, you are able to retrieve it in query results even if it's not stored, docValues is used instead. This works in spark-solr, only not with multiValued=true fields.

SolrJ and regular solr api can provide such fields, but when we use them with spark-solr:

val s1 = Map(
      "zkHost" -> "myZK",
      "collection" -> "myCollection",
      "query" -> "multivaluedField:[* TO *]",
      "fields" -> "multivaluedField",
      "max_rows" -> "100000",
      "flatten_multivalued"-> "false"
    )

val data = spark.read.format("solr").options(s1).load

data.createOrReplaceTempView("myTable")

Results with: data: org.apache.spark.sql.DataFrame = [id: string] Notice that multiValuedField is not resolved.

This is a serious issue in my opinion, because it prohibits you from using streaming method when you need multiValued fields in an RDD.

uyilmaz commented 3 years ago

In addition to above, when you specify a streaming expression instead of a query like:

val s1 = Map(
      "zkHost" -> "myZK",
      "collection" -> "myCollection",
      "expr" -> "search(myCollection,q="multivaluedField:[* TO *]",qt="/export",fl="multivaluedField,,id",sort="id asc")",
      "max_rows" -> "100000",
      "flatten_multivalued"-> "false"
    )

the "flatten_multivalued" parameter loses its effect, multivalued fields always get flattened.