signalapp / Signal-Desktop

A private messenger for Windows, macOS, and Linux.
https://signal.org/download
GNU Affero General Public License v3.0
14.2k stars 2.59k forks source link

Unable to search for punctuation characters, namely `~!@#$%^&*()_+{}|:"<>?-=[]\;',./` #5964

Open kkysen opened 2 years ago

kkysen commented 2 years ago

Bug Description

I can't search for any punctuation characters, namely ~`!@#$%^&*()_+{}|:"<>?-=[]\;',./, in my messages. Using $ as an example, I have messages that contain things like $2, but when I try to search for $ or $2, either in all of my messages, or in a specific conversation, it just says No results for "$" or No results for "$" in ${conversation_name}, respectively. I'm not sure why it refuses to textually search for punctuation characters. When I search for a string containing a $ character, such as $2, it just skips the $ and searches only for the 2. Also, it does search for these punctuation characters in contact names, but not within messages.

I also observed this on Signal Android, though I only tested $ there. I assume it works the same for all the other punctuation.

Steps to Reproduce

  1. Send a message containing containing the string ~`!@#$%^&*()_+{}|:"<>?-=[]\;',./.
  2. For each character in ~`!@#$%^&*()_+{}|:"<>?-=[]\;',./, search for character.

Actual Result:

It says No results for "${character}" for each character in ~`!@#$%^&*()_+{}|:"<>?-=[]\;',./.

Expected Result:

Show all the messages that do contain these characters, which should include at the least the message containing the string ~`!@#$%^&*()_+{}|:"<>?-=[]\;',./.

Screenshots

Platform Info

Signal Version: 5.45.0 production

Operating System: Windows 11

Linked Device Version: 5.39.3 Android

Link to Debug Log

Dyras commented 2 years ago

Possibly unrelated, but most emojis can't be searched for. Whatever the reason, you can search for 🤔 but not 😏😉😳

penryu commented 2 years ago

Given the following code from ts/util/cleanSearchTerm.ts:

  const withoutSpecialCharacters = lowercase.replace(
    /([-!"#$%&'()*+,./\\:;<=>?@[\]^_`{|}~])/g,
    ' '
  );

this appears to be by design. However, there have been updates to the regex in the past, so this might not be immutable. Can the dev team confirm if AND how these characters would be retained in the search query?

@Dyras As for emoji, I can confirm that all of those emoji do survive the query sanitizing method (which uses the regex above), but for whatever reason (unicode codepoint support?) only the thinking-face emoji successfully matches in the SQL backend.

kkysen commented 2 years ago

@penryu, thank you for finding the root cause!

Also, can the dev team explain why these characters need to be removed from the search query for messages, especially when they are not removed from the search query for contact names?

kkysen commented 2 years ago

@scottnonnenberg-signal, it appears this punctuation sanitization was added by you in this commit: https://github.com/signalapp/Signal-Desktop/blob/b3ac1373fa64117fe2a9ccfddf3712f1826c06d9/ts/util/cleanSearchTerm.ts#L1-L24

If you could explain why this is needed, and if there's a way to get around it, that would be great. Thank you!

From that cleanSearchTerm function, especially the token filtering part, it seems like the search query might be parsed and interpreted somewhere. If that's the case, is there a way to turn that off and do a pure literal search, including non-collapsed whitespace and those tokens, and, or, not, near, besides just the punctuation? It's highly non-intuitive to not be able to search for simple words like and and not, especially when there's no error message explaining what happened, nothing listed about this in Signal documentation anywhere easily searchable, or even comments explaining it in the code that does this. If keeping such "smart" query functionality available is important, I'd appreciate at least a way to turn it off and do a raw, literal search, or at the very least, something explaining why that can't be done.

penryu commented 2 years ago

It looks like the parsing, tokenizing, and sanitizing seems to have to do with the use of full-text-search (MATCH) in the message database, which pre-indexes the fields by plain words, ignoring special characters.

It's still possible to query these types of fields with LIKE, which does support querying for special characters, BUT will have significantly worse performance for large conversations and messages.

This might make a good idea for a separate ("Advanced"?) search feature.

kkysen commented 2 years ago

Thanks for figuring that out! Do you know if it's possible to run a LIKE search in the app? Or right now that's just an internal functionality?

kkysen commented 2 years ago

Usually a user's text messages will not take up much space, though. I can't imagine much more than a few GBs at most, but even that can still be searched with a linear search quite fast. I still think an exact linear search would be a better default, given that the amount of text being searched is rarely enormous. I just am not seeing why pre-indexing is such a benefit here, other than for search combinators like and, or, not, and near.

penryu commented 2 years ago

@kkysen std::disclaimer: My answers are not authoritative. :) I just poked around to see what I could figure out.

The MATCH vs LIKE behavior seems to be purely an implementation detail of the app internals. I've only managed to confirm the above by modifying the SQL code in a dev build.

IIUC, if the server doesn't store any personal data, the underlying datastore should exist on your local filesystem. However, it appears to be encrypted on my machine. I'm not aware of how the encryption works. I'm sure it's possible to decrypt with some determination, but I think it's beyond the scope of this forum.