Closed pbauer closed 2 years ago
@pbauer I just tried this with Plone 5.2, Python 3, Solr 7.7.2:
1) Create new Plone site 2) Enable Solr and reindex 3) Add file content type via Plone and upload a file (PDF) -> File is properly indexed, can be searched and shows up in the Solr index
Is this roughly what you did as well?
@pbauer ok, now I see the warning in the Plone logs, which I missed earlier:
019-11-28 15:53:09,138 WARNING [collective.solr.indexer:164][waitress] Error HTTP code=404, reason=Not Found @ /Plone/de/FILE.txt
This is the BinaryAdder from Tika, which is a different way to index binary data in Solr. This was added by Tom Gross to collective.solr. It could be that we broke that functionality in one of our many upgrades.
Though, I never relied on Tika but on the standard text extraction from Plone. Therefore I think the binary file indexing functionality is in place and working. Can you confirm this?
It would be nice to fix the Tika functionality. Though, if nobody steps up to look into this we might as well disable this for now.
Note: This has nothing to do with the recipe. This is a collective.solr issue...
I did exactly the same as you and the SearchableText for Files is always empty in solr. In portal_catalog (e.g. http://localhost:8080/Plone2/portal_catalog/manage_objectInformation?rid=219333246) it contains the expected text. Other content types also work well.
Are you using the solr-config from your recipe?
Also: I really like the approach to use Tika but what is failling there eludes me.
Ok, found it. It was a terrible combination of two problems:
For default dexterity Files the adder is configured like this:
<adapter
factory="collective.solr.indexer.DXFileBinaryAdder"
for="plone.dexterity.interfaces.IDexterityContent"
name="File"
/>
Since the post to solr/plone/update/extract
fails (still don't know why) a SolrConnectionException
is thrown and the SearchableText is removed data["SearchableText"] = ""
before the DefaultAdder
is called.
I fixed that by always using the DefaultAdder
in a overrides.zcml
:
<adapter
factory="collective.solr.indexer.DefaultAdder"
for="plone.dexterity.interfaces.IDexterityContent"
name="File"
/>
<adapter
factory="collective.solr.indexer.DefaultAdder"
for="plone.dexterity.interfaces.IDexterityContent"
name="Image"
/>
Now in default-Plone the file-content was indexed in solr. Nice! But there was another problem in my site.
I use collective.dexteritytextindexer
and had it enabled for Files because I have custom fields that I want indexed in searchabaletext as well. But collective.dexteritytextindexer prevents the file itself from being indexed, only my additional fields ended up in the index. I'll find a way to work around that tomorrow.
The main point is that yes there is a real bug in collective.solr. Until the indexer that relies on tika is fixed we'll have to disable BinaryAdder
(for AT), DXFileBinaryAdder
and DXImageBinaryAdder
. Once that is fixed the default-config need to be extended to support /update/extract
Can we move this ticket to collective.solr?
@pbauer yeah. I'd move that ticket to collective.solr.
When I use the configuration shipped with kitconcept.recipe.solr indexing Files does not work for me.
Versions tested:
When I add a simple text-file (or any other file) the default settings give me a
Not Found
:After reading the solr-docs and the code in
collective.solr.indexer.BinaryAdder
I added the following to mysolrconfig.xml
The result now was a
Bad Request
:Solr does not log any error (I run solr with
./bin/solr-foreground
).