Closed llemeurfr closed 1 month ago
@llemeurfr Thanks for sharing this! We’ll take it into consideration and might add it as an extra feature, giving the option to enable or disable it. Other regions may have different cases, so we’ll look into that as well. Thanks again for bringing this up!
I know libertarians will not be happy but ...
In Europe, scrapping websites for the purpose of Text and Data Mining and LLM training is legal (this is the good news), unless (this is the bad news) there is a machine readable signal on the website stating an opt-out from this exception to copyright. Which means that if an opt-out is set but a scrapper still fetches copyrighted content, the user of the scrapper is on the unsafe side of EU laws.
There are different ways to express such an opt-out signal: robots.txt is one (with its limitations) and the TDM Reservation Protocol is another. The fact that a machine readable opt-out is or is not an official standard by some well-known or obscure entity does not matter.
With TDMRep, the opt-out signal can be in a specific file on the web server (similar to robots.txt but specialised), in HTTP responses or in HTML pages.
TDMRep is now used by many news websites in Europe. Offering TDMRep support as a configurable option would be useful for those users who want to stay on the safe side of EU laws.
nb: TDMRep rules can be checked after the filter resulting from robots.txt.