rstudio / bookdown

Authoring Books and Technical Documents with R Markdown
https://pkgs.rstudio.com/bookdown/
GNU General Public License v3.0
3.71k stars 1.27k forks source link

Search in bs4_book often missing results #1431

Closed jtbayly closed 1 year ago

jtbayly commented 1 year ago

Until this change is committed, search is simply broken in bs4book.

Take this book as an example (it's not running the latest version of gitbook, but this hasn't changed): https://bylaws.evangelpresbytery.com

Do a search for "budget". Search will indicate that there is only one place in the book where the word is found: Appendix 1.

In point of fact, the word "budget" is found in several places in Section 4 and Section 6.

Now change ignoreFieldNorm: true as in the patch linked above, and try your search again. Now you get all three locations as search results.

I'm sure this is biting people when they search for stuff in books and don't realize they aren't seeing all the results. But they don't know what they don't know, so people aren't reporting it.

jtbayly commented 1 year ago

For the record, this particular search failure bit me in a meeting last night. I literally trusted the search results for this word and stated that the only place where "budget" was mentioned was in Appendix 1. Yeah, I looked like an idiot. Yeah, I thought this got fixed a year ago.

cderv commented 1 year ago

Sorry for the trouble @jtbayly

We did not spend much time on bookdown last year will all the development on Quarto. I understand this is a problem, especially when using the search for exact matching and all result match.

Current opiniated design of bs4_book() was to filter out the result. So not all result will be returned. Fuse.js a scoring system (https://fusejs.io/concepts/scoring-theory.html) and 0 is the best score. We keep only result that are below 0 and 0.75.

In your example your two other match have 0.83 score. So not shown. It seems Field norm as a high impact.

I don't believe #1319 the right fix for a new default. I understand it fixes your use case, but I believe it could have side effect for other books.

I'll enquire with more knowledgable fuse.js people to understand better the scoring.

However, I think best approach would be to allow customisation for bs4_book() of the fuse.js config by user who desire a specific behavior.

jtbayly commented 1 year ago

Allowing customization would be great, but I'm struggling to see why a book using Fuse.js would not want to ignore field length.

Chapters can vary in their length dramatically. Leaving this as-is means the longer a chapter is the lower every single search result will score from that chapter. If you have a long chapter, you are always going to have trouble finding things in it, even if they are exact matches and should obviously be in the search results. It seems to me that this situation is precisely why Fuse.js offers the ability to ignore field length.

If somebody can show me an example of setting "ignoreFieldNorm: true" harming search in a book, or even explain theoretically why they don't want to ignore field length, I'd be grateful. Because if this becomes customizable, I'm going to to change it for our books, and I don't want to do it without understanding why bookdown defaults the other way. It feels like I must be missing a really big downside if people are willing to put up with exact matches of a single word not being found by "search" in a book.

cderv commented 1 year ago

I have discussed today with other, and also considering your good arguments, I believe we can make this change safely.

I merged #1319 now.

Thank a lot @jtbayly for your patience, perseverance and explanation. Sorry again for the delay on this one!

Just a reminder: the default search in bs4_book() is still not outputing all exact match because we are filtering on score. This is to avoid to much results in fuzzy searching - when searching from several terms and not just one, it makes sense.

Happy to tweak again if needed. I am still planning to allow changing configuration but that is for another time.

jtbayly commented 1 year ago

Glad to hear it. And I do think scoring results makes sense, and wherever the cutoff is it's always going to be a tradeoff.

github-actions[bot] commented 7 months ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.