schemaorg / schemaorg

Schema.org - schemas and supporting software
https://schema.org/
Apache License 2.0
5.31k stars 813 forks source link

Schema.org should have at least basic expressivity helping sites declare generative AI technologies were used #3391

Open danbri opened 9 months ago

danbri commented 9 months ago

We should initiate activities in this area, move quickly to reflect the speed with which AI technologies are developing and being deployed, and do so incrementally and in a way that can be composed usefully with other aspects of schema.org.

This proposal that has several sources - I've been talking with members of the factchecking community regularly since we added ClaimReview and related terms. While we developed MediaReview, it doesn't directly go very deep into the fast growing generative AI world. Meanwhile, Schema.org also added https://schema.org/acquireLicensePage a few years ago, which enables a landing page about a media object to provide helpful metadata pointing to ways of licensing that content. Google (see their docs) and others can use both in-page schema markup, and within-the-image-file metadata, which most often uses IPTC fields. In practice it has proved useful for content publishers to be able to add markup via HTML rather than by reprocessing 1000s of images. This is likely to be also true for genAI content publishing sites.

Alongside all this, the most basic metadata task, describing who made something, and how, remains an undersolved problem. It should be easier (and/or better documented) to do things like list authors, list them in order (this is materially important to many in academia), and list their roles from a wider pallete of known role types. Last time we touched the Role aspect of the problem we introduced a very very general mechanism which has proved both unpopular and largely (but not entirely) unused. It will take some careful work to untangle that. We should also not rush a design that says "made with AI" without thinking through the different classes of tool that are appearing (accessibility and inclusiveness tools, simple visual filters, and prompt/LLM-based mechanisms). We will also need to distinguish the case where entire works are heavily AI-powered (eg. a fake essay) from those which feature items (like photos) in a less controversial manner, or which discuss them criticially - e.g. a fact check.

Some but not all of these issues are being discussed elsewhere, and implemented elsewhere.

As a first step, we can reflect into the Schema.org namespace a semi-external extension drawing from the relatively short list of codes assigned at IPTC. While they don't address all the issues above, the ability to use simple (as simple as we can make it) Schema markup as an alternative to within-the-image-file embedded metadata is of immediate value. I'll file a separate ticket for that work. This can be done primarily as an Enumeration type with the potential for other approaches to use other subtypes alongside it in the hierarchy. We'll want a basic way of linking it into a description of MediaObject too.

robrwo commented 8 months ago

See #3397

AdamGal18 commented 7 months ago

I agree that this is a great idea to work on. We need to know more about how the content was created. This argument proves that we also need to deal with the current anonymous review schema issue happening across search. It is a huge problem plaguing Google search, especially recipe search. Anonymous reviews are untraceable at the moment since they don't require written context. Recipe search is getting torn apart. It is causing complete havoc. Before we add more definitions for AI we need to address untraceable reviews not requiring written context or at least an author name. See #3401

github-actions[bot] commented 4 months ago

This issue is being nudged due to inactivity.