standardebooks / web

The source code for the Standard Ebooks website.
https://standardebooks.org
Creative Commons Zero v1.0 Universal
234 stars 61 forks source link

Replace PHP filtering with pure SQL #323

Closed colagrosso closed 7 months ago

colagrosso commented 8 months ago

This is coming along well and is ready for a look.

I also tried matching using a regex, but I changed it back to like because the regex wasn't buying anything. Worse: I couldn't get the regex matching/collating to match artwork with Unicode characters. like handles that just fine.

I can change it back to using a regex so you can see what I tried. It looked roughly like this:

// If query is "helen troy", match "Helen of Troy"                                                                                                                                                                                                                  
$wildcardQuery = preg_replace('| |', '.*', $query);  

// If query is "male", don't match "female"                                                                                                                                                                                                                         
$wholeTermQuery = '^' . $query . '$';       

$params[] = $wildcardQuery; // art.Name                                                                                                                                                                                                                                 
$params[] = $wildcardQuery; // art.EbookWwwFilesystemPath                                                                                                                                                                                                               
$params[] = $wildcardQuery; // a.Name                                                                                                                                                                                                                                   
$params[] = $wildcardQuery; // aan.Name                                                                                                                                                                                                                                 
$params[] = $wholeTermQuery; // t.Name    

[...]
                      and (art.Name regexp ?                                                                                                                                                                                                                                          
                            or art.EbookWwwFilesystemPath regexp ?                                                                                                                                                                                                                          
                            or a.Name regexp ?                                                                                                                                                                                                                                              
                            or aan.Name regexp ?                                                                                                                                                                                                                                            
                            or t.Name regexp ?)     
acabal commented 7 months ago

We need to use regex because we should be matching on word boundaries, \b, not start/end of the whole string. I also don't think we need to guess what the user intended like with helen of troy, that's going down a rabbit hole that can't be solved well in our simple use case. helen troy is a bad query.

Maybe the first step here is thinking about how we expect particular queries to match. If we query renaissance male, how does that search break down in the back end, especially when searching tags? Presumably an artwork would be tagged with renaissance or male but not renaissance male.

On the other hand we may have things tagged with a single tag african american so the space becomes a point of difficulty in whether we're searching tags or fulltext.

I think the approach is to break down a query by spaces, and search across the contents of each tag by each word with word boundary. So for the query renaissance male we would do something like tag.Name regexp "\b(renaissance|male)\b". This would matchmalebut notfemaleand it would also match the theoreticalrenaissance male` tag if there was one.

colagrosso commented 7 months ago

Thanks a bunch, that makes sense. I tried the query for the tags that way, and it works. I think the same logic works well for the other fields, too, so I applied it there. Sorry for the bad query example above, here's a different, better one that now works: Monet Lilies. That query doesn't work today, but it would work after this PR because the now individual words can match different fields. It returns other matches, too, but I think that's ok.

acabal commented 7 months ago

Great, looks very good, thanks!

acabal commented 7 months ago

Although now that I look again there's a small problem, when we replace non-alphanumeric characters in the query with a space. What happens when I search for the painting dressmaker's daughter? Any painting with s in it is returned!

colagrosso commented 7 months ago

Crap, sorry about that. I'll send you a fix today.