w3c / tdm-reservation-protocol

Repository of the Text and Data Mining Reservation Protocol Community Group
https://www.w3.org/community/tdmrep/
Other
7 stars 8 forks source link

TDM reservation for source code #32

Open Quintus opened 1 year ago

Quintus commented 1 year ago

Dear all,

there is an ongoing debate about the legality of mining in open-source software code in the U.S. debate (see here). From an EU perspective, the TDM restriction also covers computer programmes, so it should be possible to make a TDM reservation in open-source computer programmes to prevent such kinds of mining.

The current proposal however provides no managable possibility for an open-source developer to make a TDM reservation on his source code, unless he hosts all the source code himself. If, as it is custom, the source code is hosted by services like GitHub, the open-source developer has no control over the web server and thus cannot put a specific file at a specific location on the web server or modify the HTTP headers sent. Since source code is not HTML, embedding HTML metadata does not work either. What is required, I think, is some kind of magic comment in the source code itself or the presence of specific files in the repository.

This is probably related to #31, but I found it significant enough to open a separate ticket for it.

Thanks for considering this aspect.

maathieu commented 1 year ago

In my opinion this "magic comment" is the software license. We could very well conceive an open-source license that forbids using the covered source code as training data for a LLM, but as of now, you could just cover your repositories with a simple text file saying "it is forbidden to use the code of this repository to train AI models."

What people in the AI field seem to forget, is that stuff found online is not in the public domain. Even when there is no license attached to it, you can't just do whatever you want with what you find. Putting a website online (or a repository of source code) is done with the implicit intent of letting other human users read and access it for free. Robots such as the Google crawler are tolerated, because they provide an added value to the webmaster (they bring traffic to their site / people to the code repository). AI scrapers provide 0 added value, they leech data and build a model that becomes a walled garden, thus bringing traffic to the original sources down. I would be curious to read numbers of the traffic volume to Stackoverflow since ChatGPT started operations!

Ultimately, the commonly heard "fair use" argument is going to be challenged, and the idea that it should be necessary to "opt-out" of AI datamining is definitely not going to be an option, at least in Europe. So I would not worry too much about software mechanisms or TDM reservations or whatever...