Indexing files is not working

pbauer commented 4 years ago

When I use the configuration shipped with kitconcept.recipe.solr indexing Files does not work for me.

Versions tested:

Plone 5.2
Python 2.7 and 3.7
Solr 7.7.2, 8.2.0 and 8.3.0

When I add a simple text-file (or any other file) the default settings give me a Not Found:

2019-11-28 14:55:31,133 WARNING [collective.solr.indexer:164][waitress] Error HTTP code=404, reason=Not Found @ /Plone/foo.txt

After reading the solr-docs and the code in collective.solr.indexer.BinaryAdder I added the following to my solrconfig.xml

  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />

[...]

  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">SearchableText</str>
    </lst>
  </requestHandler>

The result now was a Bad Request:

2019-11-28 14:55:31,133 WARNING [collective.solr.indexer:164][waitress] Error HTTP code=400, reason=Bad Request @ /Plone/foo.txt

Solr does not log any error (I run solr with ./bin/solr-foreground).

tisto commented 4 years ago

@pbauer I just tried this with Plone 5.2, Python 3, Solr 7.7.2:

1) Create new Plone site 2) Enable Solr and reindex 3) Add file content type via Plone and upload a file (PDF) -> File is properly indexed, can be searched and shows up in the Solr index

Is this roughly what you did as well?

tisto commented 4 years ago

@pbauer ok, now I see the warning in the Plone logs, which I missed earlier:

019-11-28 15:53:09,138 WARNING [collective.solr.indexer:164][waitress] Error HTTP code=404, reason=Not Found @ /Plone/de/FILE.txt

This is the BinaryAdder from Tika, which is a different way to index binary data in Solr. This was added by Tom Gross to collective.solr. It could be that we broke that functionality in one of our many upgrades.

Though, I never relied on Tika but on the standard text extraction from Plone. Therefore I think the binary file indexing functionality is in place and working. Can you confirm this?

It would be nice to fix the Tika functionality. Though, if nobody steps up to look into this we might as well disable this for now.

tisto commented 4 years ago

Note: This has nothing to do with the recipe. This is a collective.solr issue...

pbauer commented 4 years ago

I did exactly the same as you and the SearchableText for Files is always empty in solr. In portal_catalog (e.g. http://localhost:8080/Plone2/portal_catalog/manage_objectInformation?rid=219333246) it contains the expected text. Other content types also work well.

Are you using the solr-config from your recipe?

Also: I really like the approach to use Tika but what is failling there eludes me.

pbauer commented 4 years ago

Ok, found it. It was a terrible combination of two problems:

For default dexterity Files the adder is configured like this:
```
  <adapter
  factory="collective.solr.indexer.DXFileBinaryAdder"
  for="plone.dexterity.interfaces.IDexterityContent"
  name="File"
  />
```
Since the post to solr/plone/update/extract fails (still don't know why) a SolrConnectionException is thrown and the SearchableText is removed data["SearchableText"] = "" before the DefaultAdder is called.

I fixed that by always using the DefaultAdder in a overrides.zcml:
```
<adapter
    factory="collective.solr.indexer.DefaultAdder"
    for="plone.dexterity.interfaces.IDexterityContent"
    name="File"
    />

<adapter
    factory="collective.solr.indexer.DefaultAdder"
    for="plone.dexterity.interfaces.IDexterityContent"
    name="Image"
    />
```
Now in default-Plone the file-content was indexed in solr. Nice! But there was another problem in my site.
I use collective.dexteritytextindexer and had it enabled for Files because I have custom fields that I want indexed in searchabaletext as well. But collective.dexteritytextindexer prevents the file itself from being indexed, only my additional fields ended up in the index. I'll find a way to work around that tomorrow.

The main point is that yes there is a real bug in collective.solr. Until the indexer that relies on tika is fixed we'll have to disable BinaryAdder (for AT), DXFileBinaryAdder and DXImageBinaryAdder. Once that is fixed the default-config need to be extended to support /update/extract

Can we move this ticket to collective.solr?

tisto commented 4 years ago

@pbauer yeah. I'd move that ticket to collective.solr.

kitconcept / kitconcept.recipe.solr

Indexing files is not working #13