w3c / tdm-reservation-protocol

Repository of the Text and Data Mining Reservation Protocol Community Group
https://www.w3.org/community/tdmrep/
Other
7 stars 8 forks source link

A robot is a robot, whatever its purpose #30

Open maathieu opened 1 year ago

maathieu commented 1 year ago

Hello,

I read the proposal at robots.md and I must respectfully disagree that webmasters need a new standard to handle robots designed to gather data for machine learning.

From a webmaster's perspective, a robot is a robot is a robot, and whether the pages are scraped for indexing by a search engine or for feeding AI algorithms yield the same end result: expenses caused by traffic and potential slowdowns for human visitors.

Please keep in mind that, when you are building a scraper, you are using other people's resources: build your scraper in a way that it respects existing standards webmasters have put in place to protect their bandwidth and customers.

I would suggest that, beyond quoting the usual XKCD cartoon about standards, creating a new standard to replace the existing is a sort of strawman argument. Claiming that because the end use is different, then the robot is somehow not a robot, makes little sense.

This said, I support the idea of adding the possibility of adding an extra file to disallow scraping specifically for AI training purposes, but I do not think this should replace robots.txt, only augment it.

Cheers,

Mathieu

llemeurfr commented 11 months ago

The page you're pointing at is a description of what robots.txt and robots meta directives offer.

Content providers (and therefore webmasters) need a generic way to express that NO robot can fetch content for TDM purposes. This is what the EU law allows content providers to do, and this is what they will do.

Using robots.txt requires webmasters to list EVERY robot they want to exclude, individually. This is not maintainable, and this is why another approach is needed.

maathieu commented 11 months ago

As long as there is legal uncertainty in whether using openly available, yet protected by copyright, content from the internet for feeding data models, indiscriminate scraping by AI crawlers leaves AI companies (and possibly the users of the AI databases) liable for copyright violation. What is required is an opt-in mechanism, where the webmaster actually allows the scraping for machine learning purposes. If it's opt-out, webmasters can and will litigiate by just saying that they were not aware that they had to do something special to see their copyright respected.

Talking about which, one of the principles of copyright is that you do not need to perform any specific action to assert it. By default, you are fully protected, and only if you opt-in to giving more rights to people accessing your copyrighted work (such as using a creative commons license, copyleft license...), such rights are then granted.

Therefore, your opt-out approach seems to me anthithetical to the way copyright works, and if implemented, it will be challenged - I would guess, with success, due to the above.

Maybe the most pressing issue is to sort out the core issue - is using copyrighted datasets for AI training without compensating the creators fair use?

Then once this has been settled, we can consider whether opt-in or opt-out is a better option...

But as long as this has not been settled, to give AI datasets creators some (legal) peace of mind, I would suggest focusing on an opt-in mechanism.

llemeurfr commented 11 months ago

With all respect, the fact that opt-out seems to you "antithetical to the way copyright works" is irrelevant to our work. There is a law in every European country, and this law stipulates that the only way to avoid free scrapping is an opt-out. Any complain about this law must go to the European Parliament, not this working group. The job of this working group is to specify and promote a way to implement this law in a plausible technical way.

Re. "fair use": this does not exist in Europe. EU laws are only based on copyright (more specifically "droit d'auteur" in France). When it is decided (in US court) if fair use applies or not to content scrapping for AI , and if content publishers can adopt an opt-in or opt-out position, US publishers will face the questions we have to solve now.