typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

How can I configure docsearch-scraper to run against a private internal documentation site that requires auth via oauth2? #51

Open liberty-wollerman-kr opened 9 months ago

liberty-wollerman-kr commented 9 months ago

Description

This is a general question, not an issue per-se. I'd like to first understand if there is any support for scraping content that requires authentication via OAUTH2 available from the typsense docsearch-scraper. From what I have been reading (https://typesense.org/docs/guide/docsearch.html#tips-for-common-challenges-or-more-complex-use-cases), there are some other authentication services supported, but OAUTH2 does not appear to be mentioned.

I've begun researching some possible options (i.e. https://docsearch.algolia.com/docs/legacy/run-your-own#run-the-crawl-from-the-docker-image). This "run your own" option is interesting since I have chosen the self-hosted typesense installation anyway. I'd like some help understanding how to go about setting the configuration and environment up to handle this scenario. Does this require me to fork the docsearch code and add an auth module to the docker image?

My UI is deployed in a rancher managed kubernetes cluster where I will also host the typesense server. Would it be possible to create an ingress rule that would allow, for example, a pipeline build agent through with no auth to run the scrape/indexing process?

Can you help me clarify what my options are here, and provide me with some guidance on how to implement a solution?

jasonbosco commented 9 months ago

Does this require me to fork the docsearch code and add an auth module to the docker image?

That's right - you would have to fork the code in this scraper repository. Here's a PR where another user added support for a new auth mechanism, that you could use as reference: https://github.com/typesense/typesense-docsearch-scraper/pull/41

liberty-wollerman-kr commented 8 months ago

Thanks for responding. I looked into this option, but ultimately decided to handle this a different way, which is leading to another question I have if you can help. As for the security issue, I've decided to bypass and not require auth by adding a configuration to our spring security oauth resource. Something like: httpSecurity.authorizeRequests().antMatchers("/api/controller/endpoint").permitAll()

Which segues into another nuance of my use-case, since we use a github repo for our documentation pages and the github document api to pull / refresh the documentation cache which then gets rendered to the Help section of our web app. I'm now looking at creating a new controller endpoint, or piggy-backing on our refreshCache service controller endpoint to kickoff the indexing.

My new question is, do you have a recommendation for how to do this? What kind of typesense configuration would be appropriate to do what I am suggesting, which is index the document that is retrieved by the github document api call (https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-repository-content)? Next I'm pondering how to then render the search result as a link, so how this might all tie together is not immediately obvious. I am struggling to understand what typesense out of box options might be the best for my use-case, and then I can build the custom url parsing from there possibly. Any thoughts or pointing me in the right direction would be appreciated.