philomena-dev / philomena

Next-generation imageboard
GNU Affero General Public License v3.0
90 stars 32 forks source link

Implement searching text (descriptions, comments, posts) by length and by word count #157

Closed PubliqPhirm closed 3 months ago

PubliqPhirm commented 2 years ago

Is your feature request related to a problem? Please describe. Implementing these searches would help with implementing some badges on Derpibooru.

Describe the solution you'd like

  1. Implement searching by raw_length. This is the number of characters in the textbox, including Markdown commands and URLs.
  2. Allow searching on word_count.
    • Does not need to be exact on what counts as a word, but should not count most ASCII art or Markdown formatting directives
    • Does need to count non-English words (don't ask me how to count "words" in Chinese, but Cyrillic words are every bit as space-delimited as Spanish or Latin and should be properly counted)
    • Probably shouldn't count standalone emojis and numbers as "words", but this is less important.

Describe alternatives you've considered N/A?

Additional context Searching by word_count is not meant to be used with exact numbers, as in word_count:420. Instead, both these search methods are intended to be used more or less exclusively with the inequality matches, as in raw_length.gte:6969.

I've provided a first swing at defining a word counting function below.

  def word_count(phrase) do
    phrase
    |> String.split  # split on spaces
    # filter out ASCII art and Markdown commands, hopefully
    # doesn't remove standalone emojis or single digits
    |> Enum.filter(&(String.match? &1, ~r/\w/))
    |> length
  end

It does a good enough job of finding the words as well as ignoring leading # in headers. Most other syntax (_italic_, **bold**, [a](link), etc…) doesn't actually need to be filtered because it is glued to its parent word as far as this function is concerned.

If you wanted to be extra-nerdy, perhaps a search by unique word count could also be added via something similar to this snippet I found while trying to remember how regex worked in Elixir.

Meow commented 3 months ago

As mentioned in the other issue, adds too much overhead for something considered a niche feature, would totally make sense if we were a fiction site or a site focused on text communication, but being an imageboard it is a bit redundant.