Open kkysen opened 2 years ago
Possibly unrelated, but most emojis can't be searched for. Whatever the reason, you can search for 🤔 but not 😏😉😳
Given the following code from ts/util/cleanSearchTerm.ts:
const withoutSpecialCharacters = lowercase.replace(
/([-!"#$%&'()*+,./\\:;<=>?@[\]^_`{|}~])/g,
' '
);
this appears to be by design. However, there have been updates to the regex in the past, so this might not be immutable. Can the dev team confirm if AND how these characters would be retained in the search query?
@Dyras As for emoji, I can confirm that all of those emoji do survive the query sanitizing method (which uses the regex above), but for whatever reason (unicode codepoint support?) only the thinking-face emoji successfully matches in the SQL backend.
@penryu, thank you for finding the root cause!
Also, can the dev team explain why these characters need to be removed from the search query for messages, especially when they are not removed from the search query for contact names?
@scottnonnenberg-signal, it appears this punctuation sanitization was added by you in this commit: https://github.com/signalapp/Signal-Desktop/blob/b3ac1373fa64117fe2a9ccfddf3712f1826c06d9/ts/util/cleanSearchTerm.ts#L1-L24
If you could explain why this is needed, and if there's a way to get around it, that would be great. Thank you!
From that cleanSearchTerm
function, especially the token filtering part, it seems like the search query might be parsed and interpreted somewhere. If that's the case, is there a way to turn that off and do a pure literal search, including non-collapsed whitespace and those tokens, and
, or
, not
, near
, besides just the punctuation? It's highly non-intuitive to not be able to search for simple words like and
and not
, especially when there's no error message explaining what happened, nothing listed about this in Signal documentation anywhere easily searchable, or even comments explaining it in the code that does this. If keeping such "smart" query functionality available is important, I'd appreciate at least a way to turn it off and do a raw, literal search, or at the very least, something explaining why that can't be done.
It looks like the parsing, tokenizing, and sanitizing seems to have to do with the use of full-text-search (MATCH
) in the message database, which pre-indexes the fields by plain words, ignoring special characters.
It's still possible to query these types of fields with LIKE
, which does support querying for special characters, BUT will have significantly worse performance for large conversations and messages.
This might make a good idea for a separate ("Advanced"?) search feature.
Thanks for figuring that out! Do you know if it's possible to run a LIKE
search in the app? Or right now that's just an internal functionality?
Usually a user's text messages will not take up much space, though. I can't imagine much more than a few GBs at most, but even that can still be searched with a linear search quite fast. I still think an exact linear search would be a better default, given that the amount of text being searched is rarely enormous. I just am not seeing why pre-indexing is such a benefit here, other than for search combinators like and
, or
, not
, and near
.
@kkysen std::disclaimer: My answers are not authoritative. :) I just poked around to see what I could figure out.
The MATCH
vs LIKE
behavior seems to be purely an implementation detail of the app internals. I've only managed to confirm the above by modifying the SQL code in a dev build.
IIUC, if the server doesn't store any personal data, the underlying datastore should exist on your local filesystem. However, it appears to be encrypted on my machine. I'm not aware of how the encryption works. I'm sure it's possible to decrypt with some determination, but I think it's beyond the scope of this forum.
Bug Description
I can't search for any punctuation characters, namely
~`!@#$%^&*()_+{}|:"<>?-=[]\;',./
, in my messages. Using$
as an example, I have messages that contain things like$2
, but when I try to search for$
or$2
, either in all of my messages, or in a specific conversation, it just saysNo results for "$"
orNo results for "$" in ${conversation_name}
, respectively. I'm not sure why it refuses to textually search for punctuation characters. When I search for a string containing a$
character, such as$2
, it just skips the$
and searches only for the2
. Also, it does search for these punctuation characters in contact names, but not within messages.I also observed this on Signal Android, though I only tested
$
there. I assume it works the same for all the other punctuation.Steps to Reproduce
~`!@#$%^&*()_+{}|:"<>?-=[]\;',./
.character
in~`!@#$%^&*()_+{}|:"<>?-=[]\;',./
, search forcharacter
.Actual Result:
It says
No results for "${character}"
for eachcharacter
in~`!@#$%^&*()_+{}|:"<>?-=[]\;',./
.Expected Result:
Show all the messages that do contain these characters, which should include at the least the message containing the string
~`!@#$%^&*()_+{}|:"<>?-=[]\;',./
.Screenshots
Platform Info
Signal Version: 5.45.0 production
Operating System: Windows 11
Linked Device Version: 5.39.3 Android
Link to Debug Log