Open raek opened 1 year ago
The responsible line is in GeminiCrawler::crawlPage
:
link_url = linkCompose(url, link);
Here, url
should be crawl_url
. That seems to be the simple part. But which f the other usages of url
should be changed similarly? I think all statements after crawl_url
is defined in the function should be reviewed.
I was going to write a fix, but I think I need to step back and understand the design first -- the pages
table in particular. To me it seems like the code currently tries to treat the redirecting page and the target page as the same page (there is only one row in the database, at least). Which URL should be "the" URL?
The more I think about it, the more I feel it would make sense to not handle redirects in a loop in the crawlPage
function. Here's an idea for an alternative design:
When receiving a redirect, handle that response as its own page. Update the pages
row with suitable information (eg. last_indexed_at
) about the redirect itself. Many of the columns would need to be empty (eg. content_type
). But the redirect target URL could perhaps be treated as if it were a link. Since a new link was discovered, add the link and the target URL to the links
and pages
tables. Maybe the redirecting "page" and the new "link" could be marked in some way to indicate that they are a part of a redirect. Later, the crawler will call crawlPage
again, but now with the redirect target URL as the argument.
Some reasons I think this design makes sense:
In other words, I think it makes better sense to index them separately.
What do you think?
edit: One possible downside with this design is that it is hard to enforce a maximum number of redirects. But is this any different from, say, a capsule that dynamically generates thousands of pages with their own URLs and links in between them? Maybe redirect behavior could be caught by other more general mechanisms (if such exist, I haven't read the code), such as a maximum number of URLs per host, or similar.
Hi,
I saw your email and sorry I hadn't reply yet. I'm traveling this weekend. As you said, fixing this could be a little tricky. I still debating myself how should I solve this.
In terms of other solutions. I try to block orbits (web rings) as they interfere with the SALSA ranking algorithm while not contributing much useful information about the structure of Geminispace - (I might be wrong on this, but Low Earth Orbit does interfere severely). But I also want a more robust method for generic redirection handling.
I'll think more about this tomorrow and share my thoughts with you.
No worries! It wasn't my intention to "escalate". I just didn't know that TLGS was open source at first. I wrote the issue when I found out and then I had already sent the mail. This is not urgent at all!
Hmm... Could you explain what about orbits cause interference? I host orbits on my capsule (well, they only have three users in total, and one is me, so they haven't taken off. yet.), and I would like to be a good Geminispace citizen. I'd be happy to adjust if there is something I can do from my end. (For example, I could add the redirection links to my robots.txt file.)
I also have some thoughts about why orbits can be useful (and how they ought to be implemented), but I'd like to hear your point of view first.
To clarify, I don't block servers that hosts orbits. But I do block links orbit endpoints. (ex: gemini://example.com/next
). So orbits doesn't get included into the SALSA ranking process. Your capsule will still be included even if I block the orbit.
The issue is TLGS's ranking algorithm will unreasonably favor hosts with lots of links to each others. Called the Tightly Knit Community effect. Both HITS and SALSA are vulnerable. SALSA, the current default, is not as vulnerable but still. PageRank (IIRC used by geminispace.info/GUS) has the same problem. Usually the ranked score should approximately grow linearly as the number of links referencing your page. But under TKC, a small set of pages can glob up >50% of score with just a few links among them.
From empirical evidence. Before I blocked LEO, searching for "gemini" on TLGS will result in the top 10 result including 5 LEO capsules, ranking among geminispace.info, gemini.circumlunar.space, medusae.space, etc.. Which isn't what most people looking for when searching that term.
You don't need to do anything. robots.txt should be used to block crawlers from pages that you don't want crawled. This is a search engine problem induced by deficiency in the ranking algo. I'll do my best to keep things running and serve quality result.
When the crawler processes a URL that results in a redirect, and the target page contains relative links, those relative links should be resolved using the target URL as a base, not the redirecting URL.
Example:
gemini://raek.se/
links togemini://raek.se/orbits/omloppsbanan/next?gemini%3A%2F%2Fraek.se%2F
gemini://raek.se/orbits/omloppsbanan/next?gemini%3A%2F%2Fraek.se%2F
redirects togemini://hanicef.me/
gemini://hanicef.me/
links to/about
Then the crawler should resolve the last link into
gemini://hanicef.me/about
, but currently it incorrectly resolves it intogemini://raek.se/about
.