marsara9 / lemmy-search

An enhanced search engine just for Lemmy/Fediverse
https://www.search-lemmy.com
GNU Affero General Public License v3.0
81 stars 4 forks source link

[0.5.0]The crawler should also crawl and index comments. #36

Open marsara9 opened 1 year ago

marsara9 commented 1 year ago

Currently the crawler only indexes posts via /api/v3/post/list. This should be updated to also crawl comments for the posts that were fetched this way via /api/v3/comment/list. This way when a user searches for something not only is the post title/body scanned but so are all of the comments.

marsara9 commented 1 year ago

While I'd love to include comment data, I'm noticing some major blockers to the current approach I'm using to crawling Lemmy.

Currently I call /api/v3/post/list?type_=All&sort=Old&limit=50&page=.... It appears that using the type_=All is incredibly slow on larger instances. Lemmy.world for example takes around 8-9 seconds when set to All vs 1-2 seconds when set to Local.

Next, assuming comments have the same or similar issue, we can end up doubling or tripling the time requirements to index a site.

Luckily I already have a plan to fix this but it hinges on posts having a universal id across the fediverse. This way I begin to just index a given instance's local posts and can use that universal id to create links for other instances. I just have no idea on when / if that will be resolved.

Nutomic commented 1 year ago

@marsara9 All posts and comments already have a globally unique id, which is the ap_id value (activitypub id). Its the url where that object can be viewed on its home instance. Its preferable to use this link when opening a search result. The same link can also be used to fetch a remote post to your instance, by pasting it in the search field.