Closed Krinkle closed 1 year ago
@Krinkle That should indeed be stop_urls
. I've fixed the docs.
Here are the docs for stop_urls
: https://docsearch.algolia.com/docs/legacy/config-file/#stop_urls-optional
I see the legacy link is already present in the Typesense doc page. I missed this earlier, I think because I was reading the "changes to the DocSearch Scraper config file" section so focussed I didn't realize the the previous linked to that which indeed details all the options. I think I assumed for some reason that that page wouldn't have such detailed docs so I didn't even go there. Thanks!
Description
The page at https://typesense.org/docs/guide/docsearch.html#create-a-docsearch-scraper-config-file mentions
end_urls
, however this phrase does not appear in this repository. I believe that is meant to bestop_urls
, perhaps it got renamed or was misspelled.It took me a while to figure out what kind of value this takes and how it behaves. For example, does it have to be an exact URL including protocol and identical path, or can it be a path-only URL, or a path-prefix even, or perhaps even an abitrary substring or a regex? It appears the answer is all of the above, by proxy of it being interpreted as a regex an that most freeform strings effectively double as substring matchers if taken as a regex, so long as no special characters appear besides a
.
dot which mostly works as expected even if left unescaped.Searching in the repo (link) with GitHub search doesn't lead to where this configuration variable is "really" used, since GitHub's new seach engine limits results to one match block per file it seems. I believe the part where the configuration for
stop_urls
is applied for real, is at https://github.com/typesense/typesense-docsearch-scraper/blob/0.6.0.rc2/scraper/src/documentation_spider.py#L88-L90. Which passes it asLxmlLinkExtractor(deny=stop_urls)
which in turn is a bit of a dead-end until you find the upstream docs for py-scrapy at https://docs.scrapy.org/en/2.8/topics/link-extractors.html#scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor where it is explained as:Other
I'm leaving this here for others to find. Perhaps it could be documented in more detail, or at least connecting the dot to upstream py-scrapy would help a lot without per-se duplicating it in detail within Typesense's docs.