typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
101 stars 36 forks source link

Improve stop_urls documentation #38

Closed Krinkle closed 1 year ago

Krinkle commented 1 year ago

Description

  1. Key is end_urls?

The page at https://typesense.org/docs/guide/docsearch.html#create-a-docsearch-scraper-config-file mentions end_urls, however this phrase does not appear in this repository. I believe that is meant to be stop_urls, perhaps it got renamed or was misspelled.

  1. Supported value?

It took me a while to figure out what kind of value this takes and how it behaves. For example, does it have to be an exact URL including protocol and identical path, or can it be a path-only URL, or a path-prefix even, or perhaps even an abitrary substring or a regex? It appears the answer is all of the above, by proxy of it being interpreted as a regex an that most freeform strings effectively double as substring matchers if taken as a regex, so long as no special characters appear besides a . dot which mostly works as expected even if left unescaped.

Searching in the repo (link) with GitHub search doesn't lead to where this configuration variable is "really" used, since GitHub's new seach engine limits results to one match block per file it seems. I believe the part where the configuration for stop_urls is applied for real, is at https://github.com/typesense/typesense-docsearch-scraper/blob/0.6.0.rc2/scraper/src/documentation_spider.py#L88-L90. Which passes it as LxmlLinkExtractor(deny=stop_urls) which in turn is a bit of a dead-end until you find the upstream docs for py-scrapy at https://docs.scrapy.org/en/2.8/topics/link-extractors.html#scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor where it is explained as:

  • deny: a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (i.e. not extracted).

Other

I'm leaving this here for others to find. Perhaps it could be documented in more detail, or at least connecting the dot to upstream py-scrapy would help a lot without per-se duplicating it in detail within Typesense's docs.

jasonbosco commented 1 year ago

@Krinkle That should indeed be stop_urls. I've fixed the docs.

Here are the docs for stop_urls: https://docsearch.algolia.com/docs/legacy/config-file/#stop_urls-optional

Krinkle commented 1 year ago

I see the legacy link is already present in the Typesense doc page. I missed this earlier, I think because I was reading the "changes to the DocSearch Scraper config file" section so focussed I didn't realize the the previous linked to that which indeed details all the options. I think I assumed for some reason that that page wouldn't have such detailed docs so I didn't even go there. Thanks!