Crawler resolves relative URLs incorrecty on pages reached from a redirect

raek commented 1 year ago

When the crawler processes a URL that results in a redirect, and the target page contains relative links, those relative links should be resolved using the target URL as a base, not the redirecting URL.

Example:

gemini://raek.se/ links to gemini://raek.se/orbits/omloppsbanan/next?gemini%3A%2F%2Fraek.se%2F
gemini://raek.se/orbits/omloppsbanan/next?gemini%3A%2F%2Fraek.se%2F redirects to gemini://hanicef.me/
gemini://hanicef.me/ links to /about

Then the crawler should resolve the last link into gemini://hanicef.me/about, but currently it incorrectly resolves it into gemini://raek.se/about.

raek commented 1 year ago

The responsible line is in GeminiCrawler::crawlPage:

link_url = linkCompose(url, link);

Here, url should be crawl_url. That seems to be the simple part. But which f the other usages of url should be changed similarly? I think all statements after crawl_url is defined in the function should be reviewed.

I was going to write a fix, but I think I need to step back and understand the design first -- the pages table in particular. To me it seems like the code currently tries to treat the redirecting page and the target page as the same page (there is only one row in the database, at least). Which URL should be "the" URL?

The more I think about it, the more I feel it would make sense to not handle redirects in a loop in the crawlPage function. Here's an idea for an alternative design:

When receiving a redirect, handle that response as its own page. Update the pages row with suitable information (eg. last_indexed_at) about the redirect itself. Many of the columns would need to be empty (eg. content_type). But the redirect target URL could perhaps be treated as if it were a link. Since a new link was discovered, add the link and the target URL to the links and pages tables. Maybe the redirecting "page" and the new "link" could be marked in some way to indicate that they are a part of a redirect. Later, the crawler will call crawlPage again, but now with the redirect target URL as the argument.

Some reasons I think this design makes sense:

The redirecting URL and the target URL may be served by completely different servers.
They likely change independently of each other.
They likely go offline independently of each other.
Error handling and retry policies get hairy of we can have up to 5 redirects, but only store information for one of the URLs.

In other words, I think it makes better sense to index them separately.

What do you think?

edit: One possible downside with this design is that it is hard to enforce a maximum number of redirects. But is this any different from, say, a capsule that dynamically generates thousands of pages with their own URLs and links in between them? Maybe redirect behavior could be caught by other more general mechanisms (if such exist, I haven't read the code), such as a maximum number of URLs per host, or similar.

marty1885 commented 1 year ago

Hi,

I saw your email and sorry I hadn't reply yet. I'm traveling this weekend. As you said, fixing this could be a little tricky. I still debating myself how should I solve this.

In terms of other solutions. I try to block orbits (web rings) as they interfere with the SALSA ranking algorithm while not contributing much useful information about the structure of Geminispace - (I might be wrong on this, but Low Earth Orbit does interfere severely). But I also want a more robust method for generic redirection handling.

I'll think more about this tomorrow and share my thoughts with you.

raek commented 1 year ago

No worries! It wasn't my intention to "escalate". I just didn't know that TLGS was open source at first. I wrote the issue when I found out and then I had already sent the mail. This is not urgent at all!

Hmm... Could you explain what about orbits cause interference? I host orbits on my capsule (well, they only have three users in total, and one is me, so they haven't taken off. yet.), and I would like to be a good Geminispace citizen. I'd be happy to adjust if there is something I can do from my end. (For example, I could add the redirection links to my robots.txt file.)

I also have some thoughts about why orbits can be useful (and how they ought to be implemented), but I'd like to hear your point of view first.

marty1885 commented 1 year ago

To clarify, I don't block servers that hosts orbits. But I do block links orbit endpoints. (ex: gemini://example.com/next). So orbits doesn't get included into the SALSA ranking process. Your capsule will still be included even if I block the orbit.

The issue is TLGS's ranking algorithm will unreasonably favor hosts with lots of links to each others. Called the Tightly Knit Community effect. Both HITS and SALSA are vulnerable. SALSA, the current default, is not as vulnerable but still. PageRank (IIRC used by geminispace.info/GUS) has the same problem. Usually the ranked score should approximately grow linearly as the number of links referencing your page. But under TKC, a small set of pages can glob up >50% of score with just a few links among them.

From empirical evidence. Before I blocked LEO, searching for "gemini" on TLGS will result in the top 10 result including 5 LEO capsules, ranking among geminispace.info, gemini.circumlunar.space, medusae.space, etc.. Which isn't what most people looking for when searching that term.

You don't need to do anything. robots.txt should be used to block crawlers from pages that you don't want crawled. This is a search engine problem induced by deficiency in the ranking algo. I'll do my best to keep things running and serve quality result.

marty1885 / tlgs

Crawler resolves relative URLs incorrecty on pages reached from a redirect #6