Open jyasskin opened 5 months ago
There are also some related sections of the patent policy such as 6.1 Disclosure Obligations and 6.3 Disclosure Requests. But I agree the situation could be improved along the lines you suggest.
For Invited Experts, there's https://www.w3.org/invited-experts/agreement-2023/#L118, which IEs are required to agree to to join the group. It's not explicit about LLMs, but it does cover copyright (and https://www.w3.org/invited-experts/agreement-2023/#L131 together with the patent policy covers patents).
I haven't been able to locate an equivalent for W3C Member companies joining the group and/or to appointing individuals to it.
This came up in a Slack discussion about large-language-model/AI-generated text, and while I don't think we need to mention LLMs in any normative changes, do I think there is a problem to fix.
TL;DR: I can't find text saying that WG participants need to be able to grant a copyright license to any text they add to specs, and that they shouldn't add ideas that they know are patented by someone else.
The CG CLA includes
The Member agreement discusses ownership of new ideas, but not contributions of existing ones. The "Commitments for joining the XYZ Working Group" agree to follow https://www.w3.org/Consortium/Patent-Policy-20200915/#sec-W3C-RF-license but don't mention copyright.
On copyright, I think there should be some instruction not to add text that someone else owns without being able to license it. The CG CLA covers this, but I couldn't find an agreement for WGs that says the same thing.
On patents, it's hard to know if your contribution infringes a patent you didn't know about, but it would probably be good to say not to add known patented ideas that the contributor doesn't have the right to license.
Back to the LLM aspect: because contributors might be tempted to use an LLM to generate part of a spec, and because LLMs can generate copies of their training data that would get past a normal review for spec correctness, it might be good to include some explanatory text warning about this. I think the normative changes I sketched above imply that one shouldn't generate a copyrightable amount of text from an LLM that was trained on copyrighted data, but that won't be obvious to everyone.