openzim / zimit-frontend

Zimit Public Web UI
https://zimit.kiwix.org
GNU General Public License v3.0
9 stars 8 forks source link

Block users submitting more than one task in parallel to ensure fair use #56

Open benoit74 opened 4 months ago

benoit74 commented 4 months ago

Currently we see a lot of stuff which looks like multiple requests coming from the same user.

We cannot be sure in all cases because we do not track users and do not force them to enter their emails, but in some cases we have the email and it is clear, and in other cases the URLs are so closely related that it seems obvious.

While https://github.com/openzim/zimit-frontend/issues/32 would help, it might not be sufficient.

I suggest that we should warn the user when he has already at least 1 task in the pipe with something like "this is a fair use free service, please avoid to submitting too many task at the same time, you already have xxx task in the pipe, please wait for it to complete"

Detecting the user and its associated tasks could be done with a tracking cookie. We should force the user to accept this cookie (this is a free service, we can impose some constraints) and make it clear that this tracking cookie is used only for fair use of the service and for some internal statistics.

I don't think that blocking the user would help, first because it will cause frustration and because whatever system we put in place, if we want to keep the possibility for the user to stay anonymous we cannot enforce blockage, it will always be possible / easy to circumvent.

I don't think that inventing something around "similarly looking URLs" would help since this might block two users requesting almost the same task at the same time, but not knowing each others, and again will cause frustration all while being difficult to implement.

rgaudin commented 4 months ago

Why not being transparent about the queue size and the (very wrong) ETA? This way we don't have to track nor patronize users and we'd get the same results. Could be as simple as a message saying “ There are currently 516 tasks in the pipe, your request is expected to be delivered within… 30days”

With such an information, many requests we get would simply not be sent because some user would loose interest in the ZIM if it cant be retrieved shortly.

I understant the service may look bad when the queue is very long but it's fair and respectful of the user to announce it instead of him just thinking he submitted his request and got zero response.

benoit74 commented 4 months ago

Announcing ETA is more the proposition of #32 (which mostly speaks about rank, but ETA could be added).

This issue suggests that it might not be enough to make users reasonable. It might be wrong, I don't know, at least it is tracked in an issue now.

rgaudin commented 4 months ago

Ah sorry I didn't click the issue ; I believe it was the ticket we mentioned yesterday about blocking similar requests. I guess I stand by what I proposed 18m ago 😅

kelson42 commented 3 months ago

Let me try to rephrase the problem: we should deliver the ZIM files within 24 hours (if I remember properly the SLA we have fixed to ourself) and we fail because we have too many requests (and too few hardware). This ticket is an attempt do reduce the amount of requests by reducing the amount of "abusive" ones, so at the source.

As abusive behaviour we want to avoid the users to launch many requests in parallel. In particular if it makes little sense: ie. for the same web sites.

To reduce the "abuses", we have two ways which are not exclusive to each other: use pedagogy and inform the users AND/OR forbids certain users actions.

You have focused your comments on informing users about the delay/size of the queue. I'm not against this, but IMHO it's more important to respect the SLA. And I prefer to have furstrated users because the service does not deliver many ZIM files in paralell than all users waiting for days.

Therefore I propose:

benoit74 commented 3 months ago

I'm fine with the idea to forbid certain users actions.

I wasn't aware of any SLA (good to know there was one) and I've understood (probably was wrong) that we didn't wanted at all to block users, and even tracking them with IP/cookie was a concern.

For peace of mind, I like when we can block "abuse" rather than hope for users behind reasonable.

My only concern with what is proposed is that an approach based on IP cannot work (schools, universities, companies, ...). An approach based on cookie is pretty fragile: cookies are easy to delete and once a user finds the trick, it might spread quite fast in the community. If it was just for pedagogy, somehow we can say that we do not mind. If goal is to forbid some actions, then we could spend time implementing something to block users ... and be back to square one (only pedagogy) within few months. If we want to block users, we need something more robust than cookies/IP. Which also usually means something more intrusive and usually not free (in term of money at least). I don't have much to propose unfortunately.

benoit74 commented 3 months ago

Discussed live: we need to use an hybrid approach:

This might block legit distinct users coming from same IP but coming for the first time on the site ... we consider this is acceptable for a free service and because it will be the case only for up to 24h (until the currently ongoing task is completed)

benoit74 commented 3 months ago

After some thought, I wonder if it is really worth it to consider adding a cookie. It makes the fair use blocking easier to circumvent. And users behind a single IP are probably from big companies or universities, for which we can consider deploying a custom Zimit service if need is significant.

benoit74 commented 3 months ago

What has been discussed is also that this issue must clearly indicate when the user is blocked the reason why there is a blockage (fair use of a free service), and the fact that we are open to consider deploying custom services for the ones needing it.

@Popolechien do you have any idea of phrasing / design on this?

Popolechien commented 3 months ago

Isn't a cookie browser-based? It would also be interesting to know also how many requests are made with an email address vs. not (which we could also force, because honestly only a tiny fraction is going to leave a window open until they get a result, and I suspect a lot of duplicate requests are from people not realizing results are not immediate and then restarting the query but this time with a request for an email ping)

Edit: I realize that I missed an earlier comment

if user comes without a cookie, consider tasks from same IP to decide if there is already a requested / ongoing task

As in: same request from same IP but without cookie = block? (I have no opinion really on this, I could see several scenarios warranting a pass rather than a block, incl. from a large IP block)

benoit74 commented 3 months ago

how many requests are made with an email address vs. not

you have details about this is in the export I made the other day

As in: same request from same IP but without cookie = block? (I have no opinion really on this, I could see several scenarios warranting a pass rather than a block, incl. from a large IP block)

Yes, but it will be based on user IP so there is nothing like large IP block. Only big companies or universities all "hiding" behind one single public IP. Could you explain your other scenarii?

Popolechien commented 3 months ago

Top of my head:

benoit74 commented 3 months ago

Since the blockage will be gone once the task finish, all but the first scenario you mention are only temporary and indicated potential other limitations in current UI. Only real concern for me is the case where many users are behind the same IP, since one user might be blocked without having being involved at all with this blockage. But again, if many users are behind the same IP, it is probably fair to still limit them to one task at a time, they can always switch to a different IP (their phone, their mum internet, at home, ...).

rgaudin commented 3 months ago

And users behind a single IP are probably from big companies or universities, for which we can consider deploying a custom Zimit service if need is significant.

It's not limited to that.

you have details about this is in the export I made the other day

?

benoit74 commented 3 months ago

you have details about this is in the export I made the other day

https://docs.google.com/spreadsheets/d/1GaebcExX7d4jq3ndB6zKnSRElz40fs2bccrGjpNtRs0/edit?usp=sharing

Popolechien commented 3 months ago

Ok thanks a lot. I see that about 30% of requests are anonymous, but then we can't know for sure which ones were requested a second time with an email address. Excluding these, a third of email users entered more than one query, which answers the initial question in this thread.

Other stats of interest: about 5% of requests are duplicates of existing zim files, another 5% should be seen as unrealistic (e.g. youtube or google translate), and yet another 5% are naughty (yes, I mean pr0n) requests.

Duplicates (same address requested twice or more, though possibly by different people) represent about 1/3 of all requests.