toldani / sm-transition

Scripts and SQL queries for transitioning the sciencemadness XMB database to phpBB
1 stars 0 forks source link

Search #5

Open streety opened 6 years ago

streety commented 6 years ago

Currently search seems to be incomplete. Does the search index need to be rebuilt?

Should we switch to postgresql full text?

toldani commented 6 years ago

I'm actually using MySQL, since I've wanted to get more experience with it. So I set it up to create a full-text search index, which might take a while but it's doing it. For future reference, this part should probably be done last, right before the board goes live.

streety commented 6 years ago

The search seems to have the complete index now but it is slow. Actually, for a simple/common phrase it timeouts entirely. If you search for 'ethanol distillation' it will load for >30 secs and then return a blank page.

toldani commented 6 years ago

The MySQL fulltext search worked, but left a lot to be desired. Currently trying to get sphinxsearch working:

http://sphinxsearch.com/docs/current.html

If we can get Sphinx powering the fulltext search, that'd blow away all the other search options that are available. I should have reopened this as soon as I started working on it.

streety commented 6 years ago

It doesn't seem to get the attention of Lucene, Solr, etc but I was very happy with it when I used it back in 2010. I'm sure it will work well here.

toldani commented 6 years ago

Any chance you could take a look at the issue with Sphinx? It's set up to emulate MySQL on port 9306 and has its regular interface on port 9312. You can query it by running mysql -h0 -P9306 then entering something like:

SELECT * FROM index_phpbb_sm_main WHERE MATCH('propylene carbonate');

Config file is at /etc/sphinxsearch/sphinx.conf. Relevant phpBB source files are at: /phpbb/search/fulltext_sphinx.php /phpbb/search/sphinx/* /includes/sphinxapi.php

I haven't set up refreshing with crontab yet, but it's done the first initial scan of the database easily enough. Then I can start work on a new parsing system, which I'm sure I'd be able to make much faster progress on.

For future reference though, we probably should configure the search last. Reason being, in-memory databases like Redis or Sphinx are obviously memory hogs, and so is running a virtual machine. So it probably makes the most sense to not set up that type of search until after we're done pulling data from the VM.

streety commented 6 years ago

Okay, I'll take a look at this. Might not have time today but definitely later in the week.

On Mon, Oct 1, 2018, 9:30 AM Tom Oldani notifications@github.com wrote:

Any chance you could take a look at the issue with Sphinx? It's set up to emulate MySQL on port 9306 and has its regular interface on port 9312. You can query it by running mysql -h0 -P9306 then entering something like:

SELECT * FROM index_phpbb_sm_main WHERE MATCH('propylene carbonate');

Config file is at /etc/sphinxsearch/sphinx.conf. Relevant phpBB source files are at: /phpbb/search/fulltext_sphinx.php /phpbb/search/sphinx/* /includes/sphinxapi.php

I haven't set up refreshing with crontab yet, but it's done the first initial scan of the database easily enough. Then I can start work on a new parsing system, which I'm sure I'd be able to make much faster progress on.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/toldani/sm-transition/issues/5#issuecomment-425908137, or mute the thread https://github.com/notifications/unsubscribe-auth/AAs75XRqt2DBBclH4rnJC4SrYpFDcvP0ks5ughj7gaJpZM4XAgQL .

streety commented 6 years ago

I started the sphinx service and ran a few queries. Looking at the logs at /var/log/sphinxsearch/query.log the issue doesn't seem to be with sphinx. The queries are running quickly, in a few hundredths of a second.

I couldn't connect to sphinx via the mysql command. I haven't looked into that further. Doesn't seem to be vital.

searchd currently seems to be using 1.8% memory. The VM is using 22.1% and 17.17, mysql is using 19.1% for comparison.

I have not had any queries timeout but some remain slow.

I'll have to take a break for a while. Next step is probably profiling the PHP side. Will hopefully return late afternoon/early evening.