searchmysite / searchmysite.net

searchmysite.net is an open source search engine and search as a service
GNU Affero General Public License v3.0
76 stars 7 forks source link

Automate the category detection, potentially eliminate the need for manual review #87

Open m-i-l opened 1 year ago

m-i-l commented 1 year ago

One of the strengths of searchmysite.net is the manual review process, indeed the first bullet point at https://searchmysite.net/pages/about/ is "Indexes only user-submitted and moderated sites". This is mainly to catch any sites which obviously break the Terms of Use, and provides some basic quality control which does help the search results (given garbage in = garbage out). To be clear, the only manual review is an approve/reject Basic Listings when they are submitted and (for approved sites) every year after approval. On an average day there are 4-5 sites to review (including annual reviews) which just takes 1-2 mins a day, and there is a web interface so it can be done on a phone in otherwise dead time.

Unfortunately, while the manual review process is one of searchmysite.net's strengths, it is also one of its weaknesses. After over 2.5 years, there are only just over 1,500 sites submitted. If there are 1.5 million actively maintained personal websites that could be added, searchmysite.net would only have around 0.1% of the total at the moment. The problem is two-fold: there have not been many submissions (it is mainly people submitting their own sites rather than people submitting sites they like), and if there were then the other issue is that there have been no volunteers for additional moderators (manual review of 10s or 100s of sites a day by one person wouldn't be scalable).

The proposal here is to automate the category detection, potentially also including detecting sites in the "reject" category, as a precursor to eliminating the need for manual review, and allowing the bulk loading of large lists of personal websites, e.g. from https://jessimekirk.com/blog/hn_users_links/, https://nownownow.com/ , https://personalsit.es/ , https://blogroll.org/ , https://blogs.hn/ , https://ooh.directory/ etc.

There might be case for expanding the number of categories too, e.g. to include forum sites and fan sites, although given the focus of searchmysite.net is currently narrowing to personal sites at the moment, there might not be much benefit at this stage.

If machine learning libraries are being set up for #84, it is possible these could be used here, perhaps also using the already manually classified sites (including rejected sites) used as training data.

m-i-l commented 5 months ago

Not suggesting this is the answer, but just jotting down here for future further investigation, there is the Louvain community detection algorithm (see https://en.wikipedia.org/wiki/Louvain_method). Might need to combine with the (now already deployed) local LLM to generate labels.