kanerv / TuukkaSaanaKanerva

Course project for building NLP applications
0 stars 0 forks source link

Wildcard searches #25

Closed kanerv closed 3 years ago

kanerv commented 3 years ago

d. Wildcard searches: Let the users search on incomplete terms, such as hous (easiest) or ing (similar to previous case) or h*ing (hardest). Read Chapter 3 of the book to learn more about this topic.

kanerv commented 3 years ago

Since I cannot get the module downloaded to run our stemmer, I decided to attempt the wildcard searches.

Our program now understands the easiest wildcard type (hous*). It generates a list of matching queries from the documents and then does a search with each. This is maybe not the ideal way to do it, since I think it would be nicer to include the set of wildcard queries in the same search, but I'm hopelessly late with this weeks assignment. Soooo, I'm running out of time due to the technical difficulties I've had this week and I'm sorry for that. I'm honestly dying to see the stemmer! 😢

Note that our program does not understand multi-word searches and thus if you search with 'anarch*' for example, the program won't find headers or snippets to print for 'anarcho-syndicalists' or even ''anarchism's'.

These are details that can be improved, but for now, the very very basic form of wildcard search should work.

TuukkaOT commented 3 years ago

No need to fret Kanerva, this looks really cool! Is the fact that we don't yet have the multi-word search the reason that if you search anarch*, you get "Search term not found"? For example, on anarchist feminism, it finds the term but it has no index: Query: 'anarcha-femin_s' Search term not found. No Matching doc.

kanerv commented 3 years ago

I realised that our program won't find any matches even if you search for an exact match with hyphenated words. I think the issue must be inside the test_query(query) function and not because of the lack of a multi-word search function. I wonder if other groups have similar issues...

kanerv commented 3 years ago

I added the regex we created for the week 4 program to fix the issue of hyphenated words also in the older project. I think I'll leave the wildcards as they are here. I know the solution in the week 4 program is smarter, but as I said, we have that code already up to date in a more recent program.