internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.78k stars 757 forks source link

Need to disable login authentication and ssl #338

Closed Glenruben closed 1 year ago

Glenruben commented 4 years ago

Hi! First of all, I'm sorry if this question has been covered in documentation or previous issues, but I have not been able to find information about this.

I have a use case where I need to disable the built-in login and authentication in Heritrix, and I want to use http to connect to it. I understand the need for built-in security and appreciate that it's on by default, but I also want to be able to control it.

My use case is to run Heritrix in a cloud Kubernetes cluster, protected by our company-wide login where we terminate ssl ourselves. The self-signed ssl certificate causes trouble (although I suppose I could replace it) and we would very much like to disable it and the built-in login scheme altogether.

Having this as a startup option as described here would be enormously helpful!

ato commented 4 years ago

I can confirm such an option is not currently implemented. I believe the original developers intentionally left it out as Heritrix enables trivial remote code execution and so they wanted to avoid anyone opting to use such an option out of convenience and consequently being compromised.

This seems like a reasonable use case although it does make me wonder if there's a way in which Heritrix can reasonably support external authentication proxies while also ensuring a trivial misconfiguration doesn't leave the system wide open.

Glenruben commented 4 years ago

Ok thank you for clarifying. Looks like we'll find a workaround.