qgis / QGIS-Django

Django project for QGIS related activities such as plugin repository
http://qgis.org
GNU General Public License v2.0
86 stars 59 forks source link

Implement a human validation to the download #402

Open Xpirix opened 1 month ago

Xpirix commented 1 month ago

From @timlinux

Original post at https://lists.osgeo.org/pipermail/qgis-user/2024-May/054439.html

Can you implement a check for the agent and if it is not QGIS desktop (using agent reqests iirc), make the user complete a captcha to do the download

timlinux commented 1 month ago

We want to make it more difficult for crawlers to pull out every plugin from the site which creates crippling load. So we propose two steps:

  1. Add a check in nginx to see if the user agent is QGIS and provide the download directly.
    location ~* ^/plugins/[^/]+/version/[^/]+/download/$ {
        if ($http_user_agent !~* "Mozilla/5.0 QGIS") {
            return 403 "Forbidden: Please use QGIS to download .zip files.";
        }
    }
  1. Create a new endpoint in the front end e.g. 'get' instead of download. This will prevent robots autodownloading all plugins:

image

1 - The name of the plugin is shown to the user. Hovering over the name will show a 'copy' indicator. Clicking it will copy the name to your clipboard. 2 - The user enters the name in the input box, either manually or by pasting from the clipboard action carried out in 1 above 3 - The download button is enabled and the user receives the zip file

Guts commented 1 month ago

Thanks for reacting quickly to this :clap: !

Is it possible to add user-agents to the whitelist? Typically I'm developing QDT which is a tool aiming to automate deployment of QGIS profiles and it definitely downloads plugins. I also think, about @gustry QGIS Plugin Manager.

Gustry commented 1 month ago

Thanks for the dev @Xpirix and @timlinux

QGIS-Plugin-Manager was fixed this morning to add a User-Agent CF https://github.com/3liz/qgis-plugin-manager/issues/66 Let me know if this is not OK and I will update. I'm following this issue.

Guts commented 1 month ago

QGIS-Plugin-Manager was fixed this morning to add a User-Agent CF https://github.com/3liz/qgis-plugin-manager/issues/66

So you simulate the QGIS user-agent? Is it the recommended practice? I thought that's better if every application has its own user-agent. This is what I implemented in QDT: https://github.com/Guts/qgis-deployment-cli/blob/ef43bbc658f00ad019e6e0e7b2341961a7ae49ba/qgis_deployment_toolbelt/utils/file_downloader.py#L44

Gustry commented 1 month ago

From https://github.com/qgis/QGIS/issues/57428#issuecomment-2111594399 I thought his CI pipeline was already broken because of the change. It seems I read to fast :/

As QGIS-Plugin-Manager is used on our hosting infrastructure, I though I need to make a quick patch to make it work.

As I said, I'm of course fine to define a proper user-agent for this tool if it's not fine. I'm following the discussion and I will update if needed. Let us know @timlinux and @Xpirix

benz0li commented 1 month ago

From qgis/QGIS#57428 (comment) I thought his CI pipeline was already broken because of the change. It seems I read to fast :/

@Gustry Yes, it was broken – but most likely due to overload.

So you simulate the QGIS user-agent? Is it the recommended practice? I thought that's better if every application has its own user-agent.

@Guts I do not see any harm in [qgis-plugin-manager] simulating the QGIS user-agent. No, this is not best practice.

  1. Add a check in nginx to see if the user agent is QGIS and provide the download directly.

@timlinux Crawlers/Robots will do the same, i.e. set the user-agent to QGIS, as soon as they get blocked.

benz0li commented 1 month ago

@timlinux I suggest Rate Limiting and possibly automatic blocking of CIDR blocks[^1] or single IP addresses.

[^1]: should the number of IP addresses [dropped due to rate limiting] in the NGINX logs surpass a certain threshold [of a predefined prefix size].

Xpirix commented 1 month ago

I suggest Rate Limiting and possibly automatic blocking of CIDR blocks1 or single IP addresses.

Thanks @benz0li , I think your suggestion makes sense. I will take a look at it.

Is it possible to add user-agents to the whitelist?

As I said, I'm of course fine to define a proper user-agent for this tool if it's not fine. I'm following the discussion and I will update if needed. Let us know @timlinux and @Xpirix

Sure, I will work on this issue and let you know if a specific user-agent is required. But I also think simulating QGIS's user agent is not harmful even if it is not a good practice. Otherwise, we should add each specific user agent to the nginx configuration.

@timlinux Is it okay if I combine the ideas in 2 levels:

  1. User agent check: Download directly if it's from QGIS, add the new endpoint, human validation
  2. Add rate limit or automatic blocking as suggested by @benz0li. This will be for dedicated crawlers that simulate QGIS user agent
kannes commented 1 month ago

Is there any insight into what clients or user-agents are causing the issues? Is it crawlers that would abide to robots.txt or similar rules?

Why is the server saturated on CPU for what seem like simple file downloads which should be low CPU to serve and first saturate the network bandwidth?

Is the only reason why the files are not served from a static directory that the site tries to count downloads and is that the reason for the CPU usage? If so, maybe switching to a more simple serving setup and a regular cronjob that parses logs for download counts might be a less intense solution.

As a user who occasionally manually downloads plugins and who very rarely downloads all plugins (maybe 1-2 times per year), any intermediate step sounds annoying to users.

komima commented 1 month ago

We are facing issues with the repository sometimes being unavailable for automated deployments.

We have a preconfigured list of plugins that will be downloaded and extracted on deployment for our environments, so all our users have a necessary plugins available on startup. Deploy will download the required (currently ~3, possibly more in the future) plugin zips on deploy (up to ~10 times a day, usually less), extract those and use QGIS_PLUGINPATH to provide the specified versions of the plugins to use. Clients are not downloading (and users instructed not to download) plugins themselves.

While it is possible and probably very beneficial to implement a mirror repository to avoid deployments failing due to external services downtime and also to reduce the load when cached package can be used instead, there are important considerations on how the limiting method affect mirroring use cases:

timlinux commented 1 month ago

We are facing issues with the repository sometimes being unavailable for automated deployments.

We have a preconfigured list of plugins that will be downloaded and extracted on deployment for our environments, so all our users have a necessary plugins available on startup. Deploy will download the required (currently ~3, possibly more in the future) plugin zips on deploy (up to ~10 times a day, usually less), extract those and use QGIS_PLUGINPATH to provide the specified versions of the plugins to use. Clients are not downloading (and users instructed not to download) plugins themselves.

While it is possible and probably very beneficial to implement a mirror repository to avoid deployments failing due to external services downtime and also to reduce the load when cached package can be used instead, there are important considerations on how the limiting method affect mirroring use cases:

* Custom logic for downloads will break mirroring setups

  * Can a token/username-password be used instead to skip the user-agent or form checks if implemented? Basic auth can probably be considered standard and supported in most caching/mirroring repository softwares, while custom user-agents are not (which would feel like hacky impersonation anyway)
  * Multiple custom steps to download the packages (by mocking an user clicking through the forms) would require manual synchronization scripts and keeping those up-to-date for the required plugin needed for caching

* Current URL layout is not mirroring friendly

  * For example Artifactory OSS will only support bundled repo layouts (pypi/maven/etc) for remotes. Could there be an additional PEP 503 simple repo api for QGIS plugins repo, where the files could be downloaded from? Possibly with a less CPU hungry backend serving those like [Implement a human validation to the download #402 (comment)](https://github.com/qgis/QGIS-Django/issues/402#issuecomment-2117509263) suggested

Thanks @komima for your inputs.

So for the short term, please update your script's UA so that it can download the plugins without user interaction. We will implement a better approach for the longer term. We will be happy to build a formal API with tokens issued to users for the medium term, but for now I just want to make sure the broad gamut of users have a good experience.

timlinux commented 1 month ago

Ok discussed this further with @mbernasocchi - agreed that starting with just the rate limiting option would be better - @Xpirix will you implement accordingly?

Xpirix commented 1 month ago

@timlinux Sure, I will implement it.

Xpirix commented 1 month ago

Please find the proposed PR for https://github.com/qgis/QGIS-Django/issues/402#issuecomment-2136054622 at #413