Do no accept invalid HTTPs connections

rgaudin commented 1 year ago

We've never allowed invalid HTTPs as start URL for zimit but I frequently see failed attempts in youzim.it logs about self-signed certificates.

I am not sure about the crawler's behavior regarding this. There doesn't seem to be a related flag on the crawler and there is no issue regarding certificates so maybe the default is to allow insecure connection… If it's not, we could still pass --allow-running-insecure-content via the CHROME_FLAGS environment.

Currently, it fails before we start the crawler, when we check the URL in zimit (python).

kelson42 commented 1 year ago

If it's not, we could still pass --allow-running-insecure-content via the CHROME_FLAGS environment.

--insecure seems indeed the way to go.

kelson42 commented 1 year ago

@rgaudin This seems straighforward to implement. Per default it should not be insecure, but in the context of youzim.it I would activate it (not need to allow user to change this IMO).

Doing this would allow to reduce easily the number of scraping just failing.

kelson42 commented 1 year ago

I found new errors (for Zimit scrapes) because actually the https configuration we were using is pretty picky. We should really implement this ASAP.

benoit74 commented 1 year ago

This is definitely an important issue BUT not a "good first issue" AFAIK, since this change must be done in python check_url + in call to Browsertrix crawler. And we may probably not be able to use the Chrome flag mentioned by @rgaudin now that crawler has been migrated to Brave. I would like to not use a trick like this kind of flag but implement the needed option in crawler (maybe it even exists now) since otherwise it puts our code at risk in case of a new change of underlying browser used like it just happened with the move to Brave (and implementing this upstream is probably a no-brainer). Probably not a small issue we can implement quickly with all that said.

benoit74 commented 1 year ago

And from a higher PoV, I'm really not sure it is a good idea to make this a default.

As mentioned by another issue, security of content grabbed into the ZIM is quite important since we are somehow responsible of this. If some viruses achieves to make its way into the ZIM, we will at least have our share of responsibility.

If someone achieves to man-in-the-middle any of our Zimit worker, he will very easily be able to inject all kind of payloads into the ZIM.

With insecure HTTP accepted by default, it means that the kind of attacks will be even easier to perform.

rgaudin commented 1 year ago

@kelson42 did not propose to make it a default but to use it by default in youzim.it Responsibility chain is very different there as users request a particular URL to be turned into a ZIM. TLS only ensures encryption between zimit and the server. Bad content can already be served and thus included into ZIM.

While I'm in favor of an option to disable security, I'm not comfortable enabling it on youzim.it because we'd be breaking the chain of trust.

benoit74 commented 1 year ago

I agree I did not get the point very precisely.

And even if the chain of responsibility is different on youzim.it, while the end-user can check that the URL he is entering has no bad content (are at least that this is what he intend to package), he won't be aware of any man-in-the-middle between zimit and his intended website manipulating the ZIM content. Ensuring there is minimal risk of such a man-in-the-middle is our responsibility. Ensuring no-one can package bad content into a ZIM (e.g. by setting up a server with bad content and requesting it in youzim.it) is something obviously way more debatable.

kelson42 commented 9 months ago

We have regularly case because of current strict TLS configuration, see https://github.com/kiwix/k8s/issues?q=is%3Aissue+is%3Aclosed+sort%3Aupdated-desc+routine. I have removed the question tag of this issue.

benoit74 commented 9 months ago

Correct me if you are not ok with proposition below.

Part 1

we want to add a new --insecure parameter to the crawler which would turn off all SSL certificate checks (or at least as much as possible)
this flag would be turned off by default (i.e. we use a secure configuration)
it will be exposed just like other flags on the zimfarm and zimit-frontend UIs

Part 2

an environment variable at Zimfarm level will allow to alter the default value of --insecure parameter (just like we did for Publisher which is "openZIM" by default but can be modifier if needed) ; this will be used on youzim.it to reduce errors

Some notes:

this situation is one more reason why we would benefit to have run a dry-run of the zimit scraper, to raise a nice warning early if the user forgot to disable SSL checks on his insecure website (https://github.com/openzim/zimfarm/issues/891)
I'm not really comfortable with part 2, but I can live with it especially until we have the dry-run feature (which might not be there until few years tbh)

Popolechien commented 9 months ago

(stupid) question: what is the risk of having --insecure on by default? We would have less errors, but on the flip side...?

kelson42 commented 9 months ago

* we want to add a new `--insecure` parameter to the crawler which would turn off all SSL certificate checks (or at least as much as possible)

For Zimit only.

* this flag would be turned off by default (i.e. we use a secure configuration)

I don't care, but "yes".

* it will be exposed just like other flags on the zimfarm and zimit-frontend UIs

Yes for Zimfarm, I don't think this is necessary for Zimit-frontend.

* an environment variable at Zimfarm level will allow to alter the default value of `--insecure` parameter (just like we did for  Publisher which is "openZIM" by default but can be modifier if needed) ; this will be used on youzim.it to reduce errors

Configure software with proper configuration files, configuration via ENV variable is simply bad in 90% of the cases.

* this situation is one more reason why we would benefit to have run a dry-run of the zimit scraper, to raise a nice warning early if the user forgot to disable SSL checks on his insecure website ([Design/Architecture: how to run some checks of recipe configuration from the UI (dry run) zimfarm#891](https://github.com/openzim/zimfarm/issues/891))

99% of people have no clue how to deal with a SSL error, here does not make sense IMHO. Keep it simple, the user risk is really almost null.

benoit74 commented 9 months ago

(stupid) question: what is the risk of having --insecure on by default? We would have less errors, but on the flip side...?

The flip side is that we will have all recipes running an insecure configuration. Meaning a bigger attack surface. More chances to put unsolicited content in a ZIM. And this will be our responsibility since we took the decision to run an insecure configuration by default.

The secure configuration / HTTPS ensures that the website which is responding is the proper one. If an attacker achieves to modify our network to respond with another server than the one which is supposed to respond, he won't have the proper HTTPS setup, and in the secure context the connection will fail, we won't create the ZIM with unexpected/bad content.

When a user requests an zimit scrape, he probably hopes we will put inside the ZIM content from the real website he is targeting. If an attacker has modified our network to reply to web requests with whatever he likes and we are running in the insecure configuration, we will put this illegitimate content inside the ZIM. The user will never be warned about it. We will never be warned about it. The ZIM might now contain a defaced website, harmful payloads, ...

The risk of an attacker being able to modify the network setup is probably limited in the zimit setup where we have full control on the worker. It is bigger on the opemZIM farm, because we have no control on the worker (and the worker owner could be the attacker). We do not share the same sensitivity on this risk with Emmanuel.

Once the risk will materialize (if it does, we have significant chances it won't) and if the insecure configuration has been used by default, we will have very little arguments but "we are sorry, we decided to run an insecure configuration to have less work/errors". Most users won't care. Technical users will be very angry.

If the insecure configuration is not applied by default at all and it is the youzim.it user which explicitly decides to run the recipe in an insecure configuration, as said, risk is transferred, we honored our responsibilities in the chain of trust.

Trust is always very difficult to gain and very easy to loose. But it is definitely your responsibility. I can definitely implement whatever you decide. I just cannot let you decide without warning you about the risks as I perceive them (and my perception might be totally wrong).

benoit74 commented 9 months ago

For Zimit only.

Yes

Yes for Zimfarm, I don't think this is necessary for Zimit-frontend.

OK

Configure software with proper configuration files, configuration via ENV variable is simply bad in 90% of the cases.

This is the current way of doing things in zimfarm (and most other tools we have AFAIK). But I could say that the ENV variable is configured in a k8s configuration file, so we match somehow your requirement.

99% of people have no clue how to deal with a SSL error, here does not make sense IMHO. Keep it simple, the user risk is really almost null.

I disagree, I don't see why we can't say to the user something like "oh, we are sorry but it looks like the website you are targeting is not running a secure configuration ; do you allow to us to proceed with an insecure one". All browsers already do it. And yes, in 99% of the cases the user decides to take the risk, but the responsibility has been explicitly transferred.

benoit74 commented 7 months ago

https://github.com/openzim/zimit/pull/285 somehow resolved the point for Zimit2:

Browsertrix crawler runs in insecure mode by default and has no option to force HTTPS certificates validation and so on
We have now removed the additional Python code which was forcing zimit scraper to have a valid HTTPS certificate

Should we open another issue to "force" the validity of HTTPS connection before proceeding?

kelson42 commented 7 months ago

Should we open another issue to "force" the validity of HTTPS connection before proceeding?

This should be done by Browsertrix and we should instrument it. @rgaudin and @benoit74 wanted to have a secure behaviour per default and fine to me.

benoit74 commented 7 months ago

I'm going to open issues then

benoit74 commented 7 months ago

Or more exactly, I will reopen this one and open one in Browsertrix Crawler.

benoit74 commented 7 months ago

Upstream issue is here: https://github.com/webrecorder/browsertrix-crawler/issues/510

benoit74 commented 5 months ago

Is this mandatory for 2.0?

openzim / zimit

Do no accept invalid HTTPs connections #166