Figure out how to reduce risk of unlicensed text in specs

jyasskin commented 5 months ago

This came up in a Slack discussion about large-language-model/AI-generated text, and while I don't think we need to mention LLMs in any normative changes, do I think there is a problem to fix.

TL;DR: I can't find text saying that WG participants need to be able to grant a copyright license to any text they add to specs, and that they shouldn't add ideas that they know are patented by someone else.

The CG CLA includes

2.1 Copyright Grant. I grant to you a perpetual ... copyright license ... to the full extent of my copyright interest in the Contribution.

3.1. Patent Licensing Commitment. I agree to license my Essential Claims under the W3C CLA RF Licensing Requirements. This requirement includes Essential Claims that I own and any that I have the right to license without obligation of payment or other consideration to an unrelated third party. ...

Representations, Warranties and Disclaimers. I represent and warrant that 1) I am legally entitled to grant the rights and promises set forth in this CLA and 2) I will not intentionally include any third party materials in any Contribution unless those materials are available under terms that do not conflict with this CLA and I identify any such third party materials.

The Member agreement discusses ownership of new ideas, but not contributions of existing ones. The "Commitments for joining the XYZ Working Group" agree to follow https://www.w3.org/Consortium/Patent-Policy-20200915/#sec-W3C-RF-license but don't mention copyright.

On copyright, I think there should be some instruction not to add text that someone else owns without being able to license it. The CG CLA covers this, but I couldn't find an agreement for WGs that says the same thing.

On patents, it's hard to know if your contribution infringes a patent you didn't know about, but it would probably be good to say not to add known patented ideas that the contributor doesn't have the right to license.

Back to the LLM aspect: because contributors might be tempted to use an LLM to generate part of a spec, and because LLMs can generate copies of their training data that would get past a normal review for spec correctness, it might be good to include some explanatory text warning about this. I think the normative changes I sketched above imply that one shouldn't generate a copyrightable amount of text from an LLM that was trained on copyrighted data, but that won't be obvious to everyone.

dbaron commented 5 months ago

There are also some related sections of the patent policy such as 6.1 Disclosure Obligations and 6.3 Disclosure Requests. But I agree the situation could be improved along the lines you suggest.

frivoal commented 5 months ago

For Invited Experts, there's https://www.w3.org/invited-experts/agreement-2023/#L118, which IEs are required to agree to to join the group. It's not explicit about LLMs, but it does cover copyright (and https://www.w3.org/invited-experts/agreement-2023/#L131 together with the patent policy covers patents).

I haven't been able to locate an equivalent for W3C Member companies joining the group and/or to appointing individuals to it.

w3c / AB-public

Figure out how to reduce risk of unlicensed text in specs #146