mediawiki4intranet / TikaMW

TikaMW extension
http://wiki.4intra.net/TikaMW
1 stars 1 forks source link

compatibility heads-up for MediaWiki 1.22 #2

Open pjhinton opened 10 years ago

pjhinton commented 10 years ago

This issue is just being submitted for archival purposes. There is no expectation for resolution.

A commit from Aug. 2013 (https://github.com/wikimedia/mediawiki-core/commit/2bda9a37fe4a17c9c2c800d8fe052c583e4d7e2d) that went live on the MediaWiki 1.22 release will result in a breakage TikaMW's index updating. Support is being dropped for the "SearchUpdate" hook, and there's no clear replacement for it.

vitalif commented 10 years ago

Why no clear replacement? As I understand, search update is just moved from the hook into a method of search backend class in some way.

vitalif commented 10 years ago

Oops sorry I thought you're talking about sphinxsearchengine... :)

vitalif commented 10 years ago

The replacement will just need a patch then :) One more proof OOP is non-extensible :)

vitalif commented 10 years ago

Sent a message to wikitech-l.

AmazingTrans commented 8 years ago

I did a pdf search on mediawiki4intranet, and it seems that tikamw works on 1.26? is that true? Or it actually work because the pdf index was created when it was v1.22 ?

I see that the bundle is meant for window version. I currently, have a linux MW, been running for years. We would like to implement pdf searches. Would your extension work?

I will need jvm. Would the tika jar file that you provide works in linux as long as i intall the java jre for linux?

I tried running on my server the jar file, java -jar tika-app-1.2-fix-TIKA709-TIKA964.jar -p 127.0.0.1:8072 -t -eutf-8

I got Exception in thread "main" java.lang.numberformatexception: for input string: "eutf-8" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65_ at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Hope to hear from you.

vitalif commented 8 years ago

Mediawiki4Intranet bundle has a patch which returns the hook back :-) about your tika error, try "-e utf8" instead of "-eutf8"...

vitalif commented 8 years ago

Also the bundle is primarily for linux, windows bundle is just a simple form to try it for windows users.

AmazingTrans commented 8 years ago

I followed the following instructions: I started tika server manually first but still has error. Hope to get this running. it seems your tika solution works best for intranet.

root@linux:/home/bitnami/tmp# java -jar tika-app-1.2-fix-TIKA709-TIKA964.jar -p 127.0.0.1:8072 -t -e utf8
Exception in thread "main" java.lang.NumberFormatException: For input string: "utf8"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:492)
        at java.lang.Integer.parseInt(Integer.java:527)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)

So far, I have manually installed JVM and downloaded Tika from your site.

MANUAL INSTALLATION:
1) Install Java Virtual Machine (JVM) on the server 
sudo apt-get install default-jre
2) Download a fixed copy of Apache Tika application, to /home/tmp/
http://code.google.com/p/mediawiki4intranet/downloads/detail?name=tika-app-1.2-fix-TIKA709-TIKA964.jar
3) Started tika manually but with error
java -jar tika-app-1.2-fix-TIKA709-TIKA964.jar -p 127.0.0.1:8072 -t -e utf8
4) Put following lines into your LocalSettings.php:
require_once "$IP/extensions/TikaMW/TikaMW.php";
// Server address, should be same as one on the tika-app.jar command line
$egTikaServer = '127.0.0.1:8072';
// If your Tika is newer and supports more formats than 1.2,
// you can override supported mime types with $egTikaMimeTypes (see below).
 * 5) If you install it on a MediaWiki that already has uploaded files, you should
 *    rebuild the fulltext index - use maintenance/rebuildtextindex.php on stock
 *    mediawiki, extensions/SphinxSearchEngine/rebuild-sphinx.php if you use
 *    SphinxSearchEngine or maybe other script for some other engine.
 */
AmazingTrans commented 8 years ago

I believe it got it started manually. Please refer to the picture here. http://snag.gy/3WPEQ.jpg

I restarted apache server, access the wiki, extension is shown and uploaded the file. I tried searching the context in the file, but it could not find anything. What do i see when i upload? Do i need the modified sphinx extension from your site too? or can i keep the current wiki elasticsearch?

when does tika know to process the file? and store the txt into the db?

vitalif commented 8 years ago

No, you don't need sphinxsearchengine extension, but you need to patch your MW core. Add the following line: wfRunHooks( 'SearchUpdate', array( $this->id, $this->title, &$text, $this->content ) ); after $text = $search->getTextFromContent( $this->title, $this->content ); in includes/deferred/SearchUpdate.php

This is the hook the removal of which is discussed in this issue.

About the tika jar options, '-eutf8' is the correct syntax and '-e utf8' is not.

AmazingTrans commented 8 years ago

Hmm, this time i tried rebuilding the index: this is the error i got:

root@linux:/opt/bitnami/apps/mediawiki/htdocs/maintenance# php rebuildtextindex.php
Clearing searchindex table...Done
Rebuilding index fields for 165 pages...
[f6af7f57] [no req]   MWException from line 220 of /opt/bitnami/apps/mediawiki/htdocs/includes/Hooks.php: Detected bug in an extension! Hook efTikaSearchUpdate has invalid call                                                              signature; Parameter 4 to efTikaSearchUpdate() expected to be a reference, value given
Backtrace:
#0 /opt/bitnami/apps/mediawiki/htdocs/includes/GlobalFunctions.php(4022): Hooks::run(string, array, NULL)
#1 /opt/bitnami/apps/mediawiki/htdocs/includes/deferred/SearchUpdate.php(102): wfRunHooks(string, array)
#2 /opt/bitnami/apps/mediawiki/htdocs/maintenance/rebuildtextindex.php(119): SearchUpdate->doUpdate()
#3 /opt/bitnami/apps/mediawiki/htdocs/maintenance/rebuildtextindex.php(74): RebuildTextIndex->populateSearchIndex()
#4 /opt/bitnami/apps/mediawiki/htdocs/maintenance/doMaintenance.php(103): RebuildTextIndex->execute()
#5 /opt/bitnami/apps/mediawiki/htdocs/maintenance/rebuildtextindex.php(163): require_once(string)
#6 {main}
And the java end it reports this
INFO - unsupported/disabled operation: EI
vitalif commented 8 years ago

Oops, I have a mistake in my patch. Change the line in SearchUpdate.php to wfRunHooks( 'SearchUpdate', array( $this->id, $this->title->getNamespace(), $this->title, &$text, $this->content ) ); This should work...

AmazingTrans commented 8 years ago

I was able to rebuild the index successfully, but at the java -jar . it says the following. Do you know what caused this?

INFO - unsupported/disabled operation: EI

Also, after i have uploaded the file, i tried searching the text in the associate file, but TikaMW doesn't seem to have indexed it yet.But on the other hand, if i run the "maintenance/rebuildtextindex.php", then the pdf can be searchable. Do i have to run a cronjob to run this script every upload?

Lastly, but not least. After the search, i noticed it only shows the file. Is it possible that maybe it can list like 2 / 3 line where it shows like what sentence it is in, like google does ? (Just a suggestion). :)

Really hope i can make this work, since like this is the only solution out there. :)

vitalif commented 8 years ago

INFO - unsupported/disabled operation: EI

Don't know and I think it doesn't affect anything

Do i have to run a cronjob to run this script every upload?

The moment of indexing depends entirely on the search engine used. So if your search extension is updating wiki pages in realtime it should also index files in realtime.

Lastly, but not least. After the search, i noticed it only shows the file. Is it possible that maybe it can list like 2 / 3 line where it shows like what sentence it is in, like google does ? (Just a suggestion). :)

This can't be implemented just by TikaMW... snippets are search-engine-specific. Check if your extension (CirrusSearch?) allows to store indexed text at the moment of indexing and use it to generate snippets.

AmazingTrans commented 8 years ago

vitalif, I currently only have the standard mediawiki search engine. And, it doesn't seems to index file that tika uses in real time. On the other hand, the search engine does index everything else though on real-time such as creating pages, uploading images, etc. So, i'm not sure why the pdf is not in real-time. Ideas?

Should i use the SphinxSearchEngine / CirrusSearch that you have? Will that be much better and faster than the standard one they have?

vitalif commented 8 years ago

Yes, Elasticsearch and Sphinx are much faster than MySQL fulltext search. SphinxSearchEngine does not show store indexed texts in the index and so does not provide pdf snippets though. In fact it's the standard mysql fulltext search that 100% stores these texts and can in theory generate snippets from pdf contents :) Are your PDF's indexed non-realtime or not indexed at all? And what's "non-realtime", how much time do they require to index? Also try to run maintenance/runJobs.php and check if it will make them index.

AmazingTrans commented 8 years ago

hmm, so i guess elasticsearch it is, As for the past 4 hours, the file that is uploaded is not indexed at all. But, I can search for the file name using the searchbar.

I tried running maintenance/runJobs.php , but it is still not indexed.

The only way for me to index it is to run : the following script: "php maintenance/rebuildtextindex.php"

vitalif commented 8 years ago

Interesting, I've checked my own wiki and discovered the same bug: files were indexed after editing the file page, but weren't during the initial upload. The problem was in File objects cache that was returning cached "file not exists" state to TikaMW during upload. Fixed that in master by explicitly forcing the latest revision via flag... so check the new version, it should work.

Does Elasticsearch mean you're using CirrusSearch extension?

AmazingTrans commented 8 years ago

Hmmm, could you guide me where i should fix in the master to trigger the flag? Yes, cirrusSearch. It seems like CirrusSearch utilizes ElasticSearch.

vitalif commented 8 years ago

Just update TikaMW from git. Master = default branch

AmazingTrans commented 8 years ago

Gotcha, pardon my knowledge. :) Looks like everything is working!

  1. I have signed up for wiki.4intra.net, but that site doesn't send me email to confirm. Is there any other place where we can write to you other than through github?
  2. In one of the notes, you state about new tika version. Can we use the normal jar file that is provided by https://tika.apache.org/download.html (version 11) ? or the jar file that is provided by you is modified in someway? i've tried the same command that was suggested by your file, i think those -p -eutf8 didn't go through.
 *    // If your Tika is newer and supports more formats than 1.2,
 *    // you can override supported mime types with $egTikaMimeTypes (see below).
  1. Also, Is it possible for TikaMW to index a specific directory other than the upload file usage? We have a directory that contains a lot of pdf files, and would prefer if we can choose to not upload to MW to index but use MW to search those files.
vitalif commented 8 years ago

I have signed up for wiki.4intra.net, but that site doesn't send me email to confirm. Is there any other place where we can write to you other than through github?

Did you check your spam folders? I have no problem with receiving emails from that server.

Can we use the normal jar file that is provided by https://tika.apache.org/download.html (version 11) ?

I don't know exactly, I've heard newer Tikas are REST, and TikaMW is currently only compatible with plain socket service. Old Tika versions with plain socket interface, however, had some bugs, so it seems TikaMW may be compatible only with the supplied version.

Also, Is it possible for TikaMW to index a specific directory other than the upload file usage? We have a directory that contains a lot of pdf files, and would prefer if we can choose to not upload to MW to index but use MW to search those files.

This is not possible because MW can't search on files it knows nothing of...

vitalif commented 8 years ago

Sorry for late answer ))