rails / sdoc

Standalone sdoc generator
http://api.rubyonrails.org/
Other
820 stars 131 forks source link

Rewrite search #298

Closed jonathanhefner closed 1 year ago

jonathanhefner commented 1 year ago

Prior to this commit, SDoc's search algorithm was implemented by searcher.js. searcher.js builds a regular expression for each token in the query. For example, the query "foo bar" generates the regular expressions /([f])([^f]*?)([o])([^o]*?)([o])([^o]*?)/i and /([b])([^b]*?)([a])([^a]*?)([r])([^r]*?)/i. These regular expressions fuzzy match missing letters, but fail for any other kind of typo, such as added letters or swapped letters. They can also produce surprising results. For example, the query "ActiveRecord::Base" returns ActiveRecord::AttributeAssignment as the top result due to matching "activerecord" attri"b"ute"a"s"s"ignm"e"ent, and there are six(!) other results that appear before ActiveRecord::Base.

This commit implements a new search algorithm based on character-level bigrams. For example, the query "foo bar" will look for results that match "fo", "oo", "o ", " b", "ba", and "ar". Shorthand bigrams for CamelCase names are also included in the search index. For example, entries containing "ActiveRecord" are also associated with the bigram "ar". Bigrams are weighted such that some contribute more to the match score, and results are ordered by match score.

Here are some example queries and their top results with rails/rails@7c65a4b83b583f4f27f3f20a9fb078b35823d2fe both before and after this commit:

This commit also redesigns the presentation of search results. Prior to this commit, result names were cut off at ~43 characters, and result descriptions were cut off at ~53 characters. And result descriptions included headings, further reducing relevant visible text. For example, the visible description for ActionCable::Connection::Base, which has the heading "Action Cable Connection Base", was "Action Cable Connection Base For every WebSocket". Result descriptions also included code blocks which were then mangled by Searchdoc.Panel's stripHTML function. For example, the description for ActiveModel::API::new was

  <p>Initializes a new model with the given <code>params</code>.

  <pre><code>class Person
    include ActiveModel::API
    attr_accessor ...
  </code></pre>

which was transformed to

  Initializes a new model with the given params.

  <codeclass Person
    include ActiveModel::API
    attr_accessor ...
  </pre

With this commit, search results now always display the full name. Result descriptions are also fully displayed, including non-link HTML, and are now comprised of (up to) the first 130 characters of the leading paragraph of the RDoc comment. For example, the description of ActiveModel::API::new becomes "Initializes a new model with the given \params\."


If you find any queries that give unexpected results, please share, and I will see if they can be improved.