Open brierjon opened 1 year ago
There's a reason full regular expressions aren't supported by search engines such as Solr (the one used by OpenLibrary) - they are very computationally expensive, basically requiring a full scan of the data being queried, eliminating the possibility of using any indexes.
The most you're likely to get with the current tech stack is support for exposing the Solr query language https://solr.apache.org/guide/solr/latest/query-guide/edismax-query-parser.html
As an aside, your example isn't looking for em dashes, en dashes, or any other type of dash. It's just looking for the phrase with the words "1900" and "2000". If you look at the results, you'll see that it returns titles with the phrase "1900/2000" as well as those with "1900-2000". This query is the same: https://openlibrary.org/search?q=title%3A+%221900+2000%22
Solr does actually support regular expressions out of the box, but they're currently disabled because they were interferring with queries like key:/works/OL1W
, since it treats anything that starts with /
as a regex.
Currently as far as I can tell, search is using string matching and limited patterns, but doesn't support regular expression (regex) search in the advanced search or API. If this is supported, it is not documented at https://openlibrary.org/dev/docs/api/search nor https://openlibrary.org/dev/docs/api/search. The limited https://openlibrary.org/search/howto shows some boolean operations and any match, but I'm not seeing a full regex type search.
Looking for titles with date ranges can be done with explicit searches of the range "1900-2000" based on the string match it returns 748 results which includes a match of both en and em dash ie "1900—2000", the majority of the results do not have a timeline matching this range and some just have "21st century". This identified a possible set of works which a review would improve the timeline field see: https://openlibrary.org/search?mode=everything&page=5&q=title%3A+%221900-2000%22
There is limited pattern matching it doesn't seem to support full regex which would allow searching for any pattern ie all year ranges with 4 digit years separated by em or en dash and 4 digit years to understand the full scope of the issue.
Describe the problem that you'd like solved
Regex support in the API and advanced search would help in various search tasks such as when looking for data to clean up.
Working with the data dumps is a work around for now, but this static data limits collaboration in various ways.
Proposal & Constraints
In the search API, add a parameter to indicate the search field is using regex rather than string.
In the advanced search UI, add a checkbox to the right of the search field with text "search by regex" and a ? icon with hover text to explain what is format a regular expression can be expressed and link to help page explaining the search. Possibly also in the hover include a link to a regex generator to test patterns the regex should match.
Inspiration for a generator features might be what is found on https://regex101.com/ and many other interactive regex generators, but these general tools may be too complex and broad in scope to use. See the feature suggestions such as interactive build and test of the regex to see if the regex works on the expected strings and an explanation of pattern notation and the explanation of the match.
Additional context
To split text and limited search vs full regex search. All regex patterns are supported it isn't documented in the API nor search pages to make that clear.
It would be useful to query the regex and using boolean expression for a similar match to "tiles with date range that does not have a timeline matching the same date range". Example search might look like: title:regex(\s\d{4}-\d{4}\s|\s\d{4}\—\d{4}\s) AND NOT regex(\s\d{4}-\d{4}\s|\s\d{4}\—\d{4}\s|)
While ideally general support for a regex function in search would be helpful. Enabling even a limited reviewed regex submission if coordination for server side index caching is needed, having a list of approved regex formats would make certain repetitive searches more functional. Approved regex could even be exposed as a regex facet in advanced search for reuse by others not familiar with regex, but could benefit from the search. This could function somewhat like Wikidata Query examples and have a "request query" form for specific searches for a regex to be written in plain text. ie https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples
Regex of live database vs data dump would allow for collaborative editing. Saving of a regex search that would be linkable then could be shared, reported, and tracked to highlight issues found to then be shared, flagged, etc. Additionally regex supports the process of sharing potential review tasks for flagging and continued monitoring as suggested in #7627
Stakeholders