rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.65k stars 336 forks source link

Implement the W3C TDM Reservation Protocol and enable a more standard opt-out mechanism #308

Open llemeurfr opened 1 year ago

llemeurfr commented 1 year ago

The solution which was chosen by the author after a heated discussion in #293 was to support an opt-out expressed in http headers, via well know values "noindex" and "noimageindex" plus the ad-hoc values "noai" and "noimageai".

This is already a good move: in Europe, any crawler associated with TDM and AI technologies MUST support opt-out, as stipulated by the European DSM Directive. You'll get more information here about that legal requirement. Because this soft is gathering images available for AI training, it should not integrate in its dataset images for which authors have decided an opt-out.

But "noai" and "noimageia" are not well known tokens (even if you're not alone trying them), there is nothing standard in them so far. And robots.txt is not only about http headers. Directives can be in a file stored at the root of the web site (and as html meta, but this is not interesting here). Therefore your move does not really help the community establishing trusted relationships between AI solutions and content providers (which is a requirement if you want content providers to see AI actors as partners, not enemies).

For this reason, a W3C Community Group constituted of content providers and TDM actors decided to create an open specification two years ago, and released this specification called TDMRep (for TDM Reservation Protocol). The home page of the group is there; 42 participants.

For those wondering, this specification also covers AI solutions. And this group didn't use robots.txt for clear reasons.

Adding the support of a new property in the http header, called "tdm-reservation", and filtering images if its value is 1 (number) is a no-brainer. Adding the support of a JSON file named tdmrep.json, hosted in the /.well-known repository of the Web server on which the image is stored, is a bit more complex, but still easy in Python (it is identical to the processing of the robots.txt file); and its is mandatory even if less performant.

maathieu commented 1 year ago

Hi @llemeurfr , the solution proposed in your documents is not an established standard, I would not suggest to go for it before it becomes official as this would require a lot of time and expense from the developer of this repository. Whereas robots.txt is already an established practice which already covers the scenario of limiting scraping to authorized sections of a web site.

llemeurfr commented 1 year ago

Hi @maathieu, To what extent is X-Robots-Tag: noai more standard than tdm-reservation: 1? A Web server must in both cases be tuned to generate the property.

The evolution of is_disallowed() is far from complex. Would it help if contributors propose a PR?

The main complexity will in any case be to handle a robots.txt or tdmrep.json file.

Note: the W3C Community Group is waiting for more feedback from implementers before going through the W3C Recommendation route.

rom1504 commented 1 year ago

Adding a new header in the set of disallowed ones seems indeed fine. Note img2dataset did not introduce the noai and noimageai tags, there were suggested by DeviantArt, see #218 It would be great to see a set of opt out headers standardized instead of each website creating their own.

As for tdmrep.json, i am wondering if you considered specifying in the response headers that this file exists? If that were the case, it would be possible to check this file only for the (initially small minority) of websites providing it.

llemeurfr commented 1 year ago

It would be great to see a set of opt out headers standardized instead of each website creating their own.

YES, this is why we would all like to have a worldwide standard for TDM and AI opt-out (and because a raw opt-out is not great for the future of AI, we're trying to allow deals to be made).

As for tdmrep.json, i am wondering if you considered specifying in the response headers that this file exists?

No, we didn't consider that. A simple HEAD on tdmrep.json that responds with a 404 is not a huge lack of performance. PS: You will have to parse robots.txt if you want to check its rules, which takes much more time.

DavidJCobb commented 1 year ago

To what extent is X-Robots-Tag: noai more standard than tdm-reservation: 1?

Neither of them is "more standard." TDMRep is a draft specification, it (by necessity) explicitly states that it's not a W3C standard, and it only has one listed editor: you. I counted participants in the mailing by hand and saw 20 unique names, which isn't even enough for the "There are dozens of us! Dozens!" meme. It could be a very good idea for a standard; it could be very well-designed; but right now, it isn't standard, isn't widely implemented, and wouldn't address any of the concerns people have with this repo.

Using your not-a-standard wouldn't make the maintainer of this project any less inconsiderate or any less of a disingenuous clown.

maathieu commented 1 year ago

@DavidJCobb , good point, that was a nice strawman argument from @llemeurfr . Robots.txt does not require any server tuning. Just place the file in the root directory. The scraper will download the file once, then compare all the URLs it wishes to scrape to the data placed in robots.txt . If there is no match, then it can download. If there is a match, then no download. There is no need to invent something new and convoluted to replace this established practice. A scraper is not different from a search engine spider. Google has also been doing "AI" for decades, and it has respected robots.txt on webmaster's websites. Please be good netizens. It's not because AI is the shiny new thing that all established practices must be abandoned. And also, be kindly remembered that servers are physical resources with associated costs. Whatever the purpose of the scraping, you are doing it out of the goodwill of website owners and server administrators. Do it responsibly.

Edit: there is apparently work in progress to include robots.txt support in the scraper: https://github.com/rom1504/img2dataset/pull/302 . Looking forward :-)

Padge91 commented 1 year ago

We (Spawning) are maintaining datadiligence, a package responsible for filtering opt-outs when using img2dataset and similar tools. datadiligence currently supports the existing noai HTTP headers, the proposed TDMRep HTTP headers, and the Spawning API, with additional methods planned. While the package doesn't directly support the tdmrep.json or ai.txt files just yet, the Spawning API does.

...I would not suggest to go for it before it becomes official as this would require a lot of time and expense from the developer of this repository

We made #312 to replace the opt-out logic in img2dataset with calls to the datadiligence package to help keep the maintainers of this repository focused. Accepting the changes in the PR would resolve this issue.