spdx / spdx-spec

The SPDX specification in Markdown and HTML formats.
https://spdx.github.io/spdx-spec/
Other
296 stars 140 forks source link

Proposal: Add some relationships, such as “trainedOn” or “finetuningOn”, to indicate the relationship between model and dataset #982

Open gmscofield opened 4 months ago

gmscofield commented 4 months ago

The specification now seemed to lack properties about the relationship between models and datasets. All I can see about this information is the property "informationAboutTraining". However, there are two reasons for strengthening relationships between models and datasets:

  1. Data poisoning have been a common method to attack llm. It's necessary to make it clear what dataset llm is trained on or fine-tuned on to help ai-profile users protect against llm attacks more effectively. At least they can find some similar ways to defend attacks from other models trained on the same datasets.
  2. Data profile and ai profile have been separate profiles. Adding such relationships in spdx will make their connections more clear.(It looks like these two profiles have nothing to do with each other now) So I think that adding relationships between models and datasets in spdx is necessary. What's more, relationships between datasets are also valuable, such as sampledFrom.
bact commented 4 months ago

RelationshipType already has trainedOn.

See: https://spdx.github.io/spdx-spec/v3.0/model/Core/Vocabularies/RelationshipType/

The discussion to add finetunedOn is on going for 3.1.

See meeting 2024-06-05: https://github.com/spdx/meetings/blob/main/ai/2024/2024-06-05.md

We have AI and Dataset Profiles meeting every Wednesday. You are welcome to join. See time and meeting link here: https://github.com/spdx/meetings/