Facet calculation seems tied to search query rather than result set

minusdavid commented 2 years ago

I have a Zebra database with over 1 million records that contain the word "the". If I do a complex query like this:

@attrset Bib-1 @not @or @or @or @or @or @attr 1=36 @attr 4=1 @attr 6=3 @attr 9=32 @attr 2=102 "Yogi the bear" @attr 1=4 @attr 4=1 @attr 6=3 @attr 9=28 @attr 2=102 "Yogi the bear" @attr 1=36 @attr 4=1 @attr 9=26 @attr 2=102 "Yogi the bear" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102 "Yogi the bear" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "yogi? the? bear? " @attr 4=6 @attr 9=14 @attr 2=102 "Yogi the bear" @attr 1=9011 @attr 14=1 1

It takes about 30 seconds to return with a hit count of 3325. Getting a facet response takes at least 60 seconds using yaz-client. (Unable to get the Perl ZOOM libraries to return a facet response even with connection timeouts above 60 seconds.)

If I do a very similar query without the "the":

@attrset Bib-1 @not @or @or @or @or @or @attr 1=36 @attr 4=1 @attr 6=3 @attr 9=32 @attr 2=102 "Yogi bear" @attr 1=4 @attr 4=1 @attr 6=3 @attr 9=28 @attr 2=102 "Yogi bear" @attr 1=36 @attr 4=1 @attr 9=26 @attr 2=102 "Yogi bear" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102 "Yogi bear" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "Yogi? bear? " @attr 4=6 @attr 9=14 @attr 2=102 "Yogi bear" @attr 1=9011 @attr 14=1 1

It returns instantly with a hit count of 3325. Getting a facet response takes about 2 seconds using yaz-client. (Perl ZOOM libraries cope easily.)

--

Since the result set should be the same for both queries, it seems that the facet calculation cannot be based on the result set alone, and must involve the records that contribute to the creation of the result set.

I don't know enough about Zebra's internals to troubleshoot this one too much further.

minusdavid commented 2 years ago

Actually, my counts were slightly off. The "the" query returned 3323 results, while the query without "the" returned 3325 results. Comparing the Zebra facet responses... the difference is much greater than 2 although I'm guessing the term occurrence is based off indexed values rather than records...

--

The Zebra configuration is using "facetNumRecs:1000" so in theory that should limit it further?

It collects 20 terms as per "int no_collect_terms = 20" in index/retrieve.c...

At a glance, the "term_collect_freq" function looks like it should use the result set. But beyond that it starts getting a bit obscure for me.

Do you know what might be causing this large difference in facet calculation times?

minusdavid commented 2 years ago

I cloned idzebra, added some additional logging, statically compiled, and then ran on a 1,000,000+ records Zebra database.

For 20 facet terms, it seems to be taking about 3 seconds per term, which then aggregates up to that 60+ seconds.

But I must not be logging the right thing as the output looks the same for the 60 second facet generation as the 2 second facet generation... aside from the first one being much slower...

minusdavid commented 2 years ago

The slowdown appears to be in index/zsets.c in the zebra_count_set function.

In the query without "the", the while loop with rset_read executes quickly and with a small number of iterations.

However, the query with "the", the while loop takes a long time. The 1st iteration for the rset_read can take up to 2 seconds sometimes and it iterates many more times (while ultimately ending up with the same occurrence count).

minusdavid commented 2 years ago

The "zebra_count_set" is called from "freq_term" in index/retrieve.c via the following:

zebra_count_set(zh, rset, &hits, zh->approx_limit);

The issue must be with the result set in rset then...

over the past 40 minutes, I've managed to create 4 temporary result files that are 1.3GB in size... as per https://github.com/indexdata/idzebra/issues/33

minusdavid commented 2 years ago

So going back to the original result set... it takes nearly 30 seconds to do that lookup for "the", which probably makes sense as there are over 1,000,000 records that contain "the", although historically I thought Zebra was supposed to be able to process that many records quickly...

14:56:47-24/05 zebrasrv(1) [log] dict_lookup_grep: (\x01\x01\x03)(th(e|\xC3\xA9|\xC3\xA8|\xC3\xAA|\xE1\xBA\xBD|\xC4\x95|\xC4\x99|\xC4\x97|\xC4\x9B|\xC8\x85|\xC8\x87).*)

minusdavid commented 2 years ago

Ah not only that... but now that I look at that regex.. it's also trying all kinds of variations of "e" with accents. That must be the ICU coming into play. And then there's truncation there as well. So it's certainly doing a lot there...

minusdavid commented 2 years ago

When I look at "freq_term" in index/retrieve.c, I can get the correct hit count from "reset_set". But then I don't understand what's happening with the "rset" RSET struct that's passed to "zebra_count_set".

It looks like an empty rset is created with the original result set as its child...

minusdavid commented 2 years ago

I've run out of time but it's kind of looking like "the result set" also contains all the result sets that went into creating it?

If that's true, that might explain why two results sets with the same hit count can have vastly different faceting times?

It was interesting hacking on Zebra, but the code gets a bit obscure for me.

MikeTaylor commented 2 years ago

Hi, @minusdavid , and thanks for this and other well-documented issues. Sorry for radio silence. @adamdickmeiss, who is the principal Zebra wizard is out of the office this week. I imagine he will get back to you early next week. Sorry for the delay, and thank you for the investigations so far.

minusdavid commented 2 years ago

No worries @MikeTaylor . My apologies for all the comments! Hopefully they're helpful.

I probably don't have heaps of time to work on this particular issue, but if @adamdickmeiss can give me some guidance I think I'm in a good place to do more troubleshooting.

It's too bad I didn't set up my little Zebra dev environment sooner. I could've probably sent a PR for that memMax issue heh.

minusdavid commented 2 years ago

Hi, @minusdavid , and thanks for this and other well-documented issues. Sorry for radio silence. @adamdickmeiss, who is the principal Zebra wizard is out of the office this week. I imagine he will get back to you early next week. Sorry for the delay, and thank you for the investigations so far.

Is there anymore word on this one?

minusdavid commented 1 year ago

Still noticing very slow facet calculation. Considerably slower than the actual search even.

mrenvoize commented 1 year ago

It would be great to see this one move forward.. might help stave off/slow the move to elasticsearch we're seeing. I have a soft spot for Zebra still personally.

sebhammer commented 1 year ago

Hi guys,

I wish I had more exciting input, but the reality is that from us at Index Data, it is unlikely that we will be able to push development resources toward enhancing this functionality in Zebra. The reality is that the move towards Solr and Elasticsearch happened a decade ago, and even our projects (like FOLIO and ReShare) these days will tend to use those tools. Facets in Zebra was one of those things that were added when the Elastic writing was already on the wall, and it's not been super-widely used.

I certainly also have a soft spot for Zebra, and I would love to see others pick up on some weak spots. If someone is interested and we can assist, even if just with moral support, we're more than game.

--Sebastian

On Mon, Oct 9, 2023 at 4:39 AM Martin Renvoize @.***> wrote:

It would be great to see this one move forward.. might help stave off/slow the move to elasticsearch we're seeing. I have a soft spot for Zebra still personally.

— Reply to this email directly, view it on GitHub https://github.com/indexdata/idzebra/issues/35#issuecomment-1752568202, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBHN2M3ANQGWYM4SN4IRHDX6OZ5HAVCNFSM5WUTZ3Y2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZVGI2TMOBSGAZA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

minusdavid commented 1 year ago

Thanks for getting back to us, Sebastian.

It was about a decade ago that the Koha community started using the facets in Zebra. I think we still have it on as the default option for new installations, although larger libraries have to turn it off because it's too slow on a large result set. I think that the larger databases will need to switch over to Elastic at some point. As you say, the writing is on the wall.

It's too bad though as Zebra is such a great lightweight tool. Years ago, I actually learned to read C just so that I could read Zebra source code! Zebra was the scariest/least understood part of the Koha stack, so naturally I wanted to learn everything about it.

Like yourselves though, it's tough to find time to hack on Zebra/YAZ. I'll keep that moral support in mind though.

indexdata / idzebra

Facet calculation seems tied to search query rather than result set #35