Closed katesoule closed 6 months ago
Should this also include the license the content was made available under where relevant? See also #255. It's a bit unclear at the moment - for example the CONTRIBUTIONS guide states "DO NOT contribute copyrighted content or content coming from another system.".
No processing needed in the CLI, but we need users to provide a link to any third party source they might have leveraged when issuing a PR to the taxonomy repo (knowledge or skill contribution). Once we get this in place, we can remove this restriction in the contribution guidelines, allowing people to leverage content from other sources so long as they are permissible licensed and they share a link. I'm working with legal on the exact list of license and how this should be articulated on the Contributions page, will share that once its ready.
Adding a few more details from legal:
This field should sit at the same level as the question, answer, and context fields within each example, and should be required for everyone to fill out. If there is any third party content that is leveraged for the YAML (e.g. if all the QnAs were directly taken from a website full of puns, or if the context was copy and pasted from wikipedia), then the link of where that content was obtained should be provided in that field.
If the entire QnA example was created by hand (e.g. I made up a bunch of puns on the top of my head), then the contributor should write something akin to "Self-Created".
Separately we will need to update the contributor guidance to make sure we have instructions on how to fill out this field, and to only include third party content that is licensed according to a list of license recommendations (still being finalized).
is this just for transparency and traceability but not for any actual use of that field? if it's a URL or something dynamic as the source, how do we make sure the content of the source is unchanged? I'm a bit concerned about a case where at the time of filling the YAML, the source URL has no issue but later on the content of the source URL is modified and potentially contains malicious materials. people might go visiting the link after seeing it as the source filed in the YAML
This is not intended to be used by any part of the pipeline, or for reviewers to even click on it. But it is required we have some sort of attribution for third party content if we want to allow people to contribute content that leverages any of the most commonly used licenses (CC-BY-SA, Apache 2.0, MIT, etc. all require attribution).
I defer to you and the team on the security risk this may introduce. Are there other open source projects that have similar contributions styles that we can learn from?
This is not intended to be used by any part of the pipeline, or for reviewers to even click on it. But it is required we have some sort of attribution for third party content if we want to allow people to contribute content that leverages any of the most commonly used licenses (CC-BY-SA, Apache 2.0, MIT, etc. all require attribution).
Thanks for clarifying!
I defer to you and the team on the security risk this may introduce. Are there other open source projects that have similar contributions styles that we can learn from?
Me and the team are not legal experts so I don't think we should be making decisions here.
Contributing documents is not common in open-source projects but for codes there is stuff like DCO, which we are currently enforcing in the cli
repo.
... Legal is advising we ask contributors to provide a link that shows where they got any external document used for the skill's context or knowledge source ...
It looks like you have consulted legal on this. Were there alternative options discussed or so? I know little about what has been discussed but perhaps you could share more information about what has been discussed and how this decision was made.
DCO is also being required in the taxonomy repo, but unfortunately doesn't count as attribution. When you attribute something you have to cite where you found it, and we need to provide traceability so that you can go back and inspect whether any prior modifications were made to the content that was contributed (e.g. there is a HotPot dataset we are using for the context field in our seed data for writing skills, this dataset is actually a modification of wikipedia data, the license for this dataset requires that we can back and trace those modifications).
Legal didn't provide any alternatives for how we can attribute these types of sources, but I can ask and see if they have other ideas beyond a link (although I'm not hopeful). I'm curious if @lhawthorn has suggestions, as she was one of the first people to have suggest we should be collecting links for all the contributed sources.
The other thought I had is we could put in our contributor policies to only leverage third party content from other trusted and goverened sites used to share open source data and other open source content, like Hugging Face, wikipedia, or .gov websites. Although this will restrict some of the creativity of contributions.
We could follow the same style as Wikipedia references, namely having the submitter cite original source on the web along with a link to the content preserved at time of access via Internet Archive's Wayack Machine (preserving URLs via https://archive.is/).
To be entirely clear, here is an example using the Wikipedia page for Red Hat:
https://en.wikipedia.org/wiki/Red_Hat#References
For reference 2 in this list, the content is attributed to a page that no longer exists (accessing yields 404 error): https://www.redhat.com/en/about/company/management/paul-cormier
But one can still access the content as it appeared on the date of original access via: https://web.archive.org/web/20200406151430/https://www.redhat.com/en/about/company/management/paul-cormier
While requesting that this content be included along with a PR is an extra step for the contributor, it's a simple step and in keeping with project norms for open content projects.
We may wish to consider supporting Internet Archive financially with a tax deductible donation in recognition of the valuable service the organization would provide to us as part of running this project. I would certainly recommend doing so, though of course it is not required.
But it is required we have some sort of attribution for third party content if we want to allow people to contribute content that leverages any of the most commonly used licenses (CC-BY-SA, Apache 2.0, MIT, etc. all require attribution).
Whether it was intended to be allowed or not, this has in fact already happened. There are multiple examples of YAML files in git tree which have directly copied text from wikipedia and thus are CC-BY-SA. See #255 for the examples.
is this just for transparency and traceability but not for any actual use of that field? if it's a URL or something dynamic as the source, how do we make sure the content of the source is unchanged? I'm a bit concerned about a case where at the time of filling the YAML, the source URL has no issue but later on the content of the source URL is modified and potentially contains malicious materials. people might go visiting the link after seeing it as the source filed in the YAML
For Wikipedia (which accounts for almost all the 3rd party content currently in the tree) it's possible to link to a specific revision of the article. This is necessary for a number of the existing cases since they contain content taken from 3-4 year old articles that have since changed on the main page. For e.g. https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459 - this content is present in the key_points skill, but no longer appears at https://en.wikipedia.org/wiki/Les_Richards
The only exception to this is pages that have actually been deleted since the content was collected. Perhaps for these (few) examples we could use archive.org as @lhawthorn suggests.
Adding a few more details from legal: This field should sit at the same level as the question, answer, and context fields within each example, and should be required for everyone to fill out. If there is any third party content that is leveraged for the YAML (e.g. if all the QnAs were directly taken from a website full of puns, or if the context was copy and pasted from wikipedia), then the link of where that content was obtained should be provided in that field.
One thing to bear in mind is that this format is potentially awkward for existing examples in the taxonomy since a single question/answer/context contains content from multiple distinct sources. For e.g. the main_takeaway skill has a context
key that is a concatenation of content from ten separate Wikipedia URLs.
Since the information isn't intended to be used as part of the pipeline perhaps it would be easier to express this in comments at the top of the file? This would allow linking to arbitrary lists of source URLs and would provide the necessary attribution.
Since the information isn't intended to be used as part of the pipeline perhaps it would be easier to express this in comments at the top of the file? This would allow linking to arbitrary lists of source URLs and would provide the necessary attribution.
This sounds like it could be a good solution, can you provide an example of what this would actually look like to someone who navigates to the taxonomy repo for the first time and is inspecting a skill? I can take the example to legal and ask if this would be sufficient attribution. Agreed on the point of needing to list multiple source URLs.
This sounds like it could be a good solution, can you provide an example of what this would actually look like to someone who navigates to the taxonomy repo for the first time and is inspecting a skill? I can take the example to legal and ask if this would be sufficient attribution. Agreed on the point of needing to list multiple source URLs.
The attribution requirements in licenses don't specify a particular syntax, at most they would say what kind of information needs to be attributed. Example required by CC-BY-SA:
https://creativecommons.org/licenses/by-sa/4.0/deed.en#ref-appropriate-credit
Attribution — You must give appropriate credit [1], provide a link to the license,
and indicate if changes were made. You may do so in any reasonable manner,
but not in any way that suggests the licensor endorses you or your use.
[1] appropriate credit — If supplied, you must provide the name of the creator
and attribution parties, a copyright notice, a license notice, a disclaimer notice,
and a link to the material. CC licenses prior to Version 4.0 also require you to
provide the title of the material if supplied, and may have other slight differences.
That shows it is potentially more than just a list of URLs, further info may be required depending on the license. If there's no compelling need for machine readable attribution, then utilizing comments at the top of the file for attribution is less burden and adaptable for differing attribution requirements of licenses.
Using free-form comments for attribution is what I typically see in source code where copying has taken place.
Reviewing Wikipedia:Copyrights, and CC-BY-SA 4.0 would something like the following be sufficient?
The attribution guidelines from Wikipedia:Copyrights state:
To re-distribute text on Wikipedia in any form, provide credit to the authors either by including a) a hyperlink (where possible) or URL to the page or pages you are re-using, b) a hyperlink (where possible) or URL to an alternative, stable online copy which is freely accessible, which conforms with the license, and which provides credit to the authors in a manner equivalent to the credit given on this website, or c) a list of all authors.
Using the key_points skill as an example:
# The following YAML contains copyright content excerpted from Wikipedia and made available
# under the Creative Commons Attribution-ShareAlike 4.0 International Public License. The
# content retains the copyright of the respective Wikipedia editors and contributors.
#
# https://creativecommons.org/licenses/by-sa/4.0/
#
# The following URLs are attributed as the original source of this content:
#
# https://en.wikipedia.org/wiki/Etan_Boritzer
# https://en.wikipedia.org/wiki/Harry_S._Webb
# https://en.wikipedia.org/wiki/Ian_Barry_(director)
# https://en.wikipedia.org/wiki/Pinto_Rustlers
# https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
# https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
# https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
# https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
# https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
# https://en.wikipedia.org/wiki/Peter_Levin
created_by: shivsr
seed_examples:
- answer: '1. Etan Boritzer is an American writer of children\''s literature, best
known for his book "What is God?" published in 1989.\n2. His "What is?" series,
which includes books like "What is Love?", "What is Death?", "What is Beautiful?",
etc., is a popular teaching guide for parents, teachers, and child-life professionals.\n3.
The series has caused controversy due to its universalist views and has been translated
into 15 languages.\n4. Boritzer was first published at the age of 13 and now lives
in Venice, California, where he maintains his publishing office.\n5. He has helped
numerous other authors get published and is also a yoga teacher and an erudite
speaker on "The Teachings of the Buddha."\n6. Harry S. Webb was an American film
producer, director, and screenwriter who produced and directed 100 films between
1924 and 1940.\n7. Webb and his wife, Rose Gordon, created Reliable Pictures Corporation
in 1933, which produced Westerns until 1937.\n8. Webb then started Metropolitan
Pictures Corporation in 1938, which produced several films until 1940.\n9. Ian
Barry is an Australian director of film and TV.\n10. "Pinto Rustlers" is a 1936
American western film directed by Harry S. Webb and starring Tom Tyler, George
Walsh, and Al St. John.\n11. Les Richards was an Australian rules footballer who
played with North Melbourne in the Victorian Football League (VFL).\n12. Brian
Saunders was a male weightlifter who competed for England.\n13. Theodred II was
a medieval Bishop of Elmham, whose date of consecration is unknown, but the date
of his death was between 995 and 997.\n14. Terence D. Robinson was a male wrestler
who competed for England.\n15. Pamela Jain is an Indian playback singer, born
on 16th March.\n16. Peter Levin is an American director of film, television, and
theatre.
'
context: "Etan Boritzer( born 1950) is an American writer of children \u2019s literature\
\ who is best known for his book\" What is God?\" first published in 1989.\\n\\\
nHis best selling\" What is?\" illustrated children\\'s book series on character\
\ education and difficult subjects for children is a popular teaching guide for\
\ parents, teachers and child- life professionals.\\n\\nBoritzer gained national\
\ critical acclaim after\" What is God?\" was published in 1989 although the book\
\ has caused controversy from religious fundamentalists for its universalist views.\\\
n\\nThe other current books in the\" What is?\" series include\\n\\nWhat is Love?,\\\
n\\nWhat is Death?,\\n\\nWhat is Beautiful?,\\n\\nWhat is Funny?,\\n\\nWhat is\
\ Right?,\\n\\nWhat is Peace?,\\n\\nWhat is Money?,\\n\\nWhat is Dreaming?,\\\
n\\nWhat is a Friend?,\\n\\nWhat is True?,\\n\\nWhat is a Family?,\\n\\nWhat is\
\ a Feeling?\"\\n\\nThe series is now also translated into 15 languages.\\n\\\
nBoritzer was first published in 1963 at the age of 13 when he wrote an essay\
\ in his English class at Wade Junior High School in the Bronx, New York on the\
\ assassination of John F. Kennedy.\\n\\nHis essay was included in a special anthology\
\ by New York City public school children compiled and published by the New York\
\ City Department of Education.\\n\\nBoritzer now lives in Venice, California\
\ and maintains his publishing office there also.\\n\\nHe has helped numerous\
\ other authors to get published through\" How to Get Your Book Published!\" programs.\\\
n\\nBoritzer is also a yoga teacher who teaches regular classes locally and guest-\
\ teaches nationally.\\n\\nHe is also recognized nationally as an erudite speaker\
\ on\" The Teachings of the Buddha.\"\\nHarry S. Webb (October 15, 1892 \u2013\
\ July 4, 1959) was an American film producer, director and screenwriter.\\n\\\
nHe produced 100 films between 1924 and 1940.\\n\\nHe also directed 55 films between\
\ 1924 and 1940.\\n\\nHe was the brother of \"B\"-film producer and director Ira\
\ S. Webb and the husband of screenwriter Rose Gordon, who wrote many of his films.\\\
n\\nIn 1933 Webb and Bernard B. Ray created Reliable Pictures Corporation with\
\ a studio at Beachwood and Sunset Boulevard in Hollywood.\\n\\nReliable produced\
\ and released many Westerns, starting with \"Girl Trouble\" (1933), until the\
\ company closed in 1937.\\n\\nIts final release was \"The Silver Trail\".\\n\\\
nWebb and Ray then started Metropolitan Pictures Corporation in 1938, which produced\
\ and released several films until 1940, its last being \"Pinto Canyon\".\\n\\\
nWebb then produced Westerns for Monogram Pictures.\\n\\nHe was born in Pennsylvania\
\ and died in Hollywood, from a heart attack.\\nIan Barry is an Australian director\
\ of film and TV.\\nPinto Rustlers is a 1936 American western film directed by\
\ Harry S. Webb and starring Tom Tyler, George Walsh and Al St. John.\\nLes Richards(\
\ date of birth unknown) was an Australian rules footballer who played with North\
\ Melbourne in the Victorian Football League( VFL).\\nBrian Saunders( date of\
\ birth and death unknown) was a male weightlifter who competed for England.\\\
nTheodred II was a medieval Bishop of Elmham.\\n\\nThe date of Theodred\\'s consecration\
\ unknown, but the date of his death was sometime between 995 and 997.\\nTerence\
\ D. Robinson( date of birth and death unknown) was a male wrestler who competed\
\ for England.\\nPamela Jain is an Indian playback singer.\\n\\nDate of Birth:16th\
\ March.\\nPeter Levin is an American director of film, television and theatre.\n"
question: Generate the key points from the given text.
task_description: ''
Legal has provided us with additional language to include in the contributor guidelines indicating how third party content is licensed, so the top part shouldn't be necessary. We just need all the links attributed. Having the links as a list in the comments of the file should also be sufficient, so long as it is clear in the template that this information is required. If no links are used, then in the comments the contributor should write something like "self-authored".
Legal has provided us with additional language to include in the contributor guidelines indicating how third party content is licensed, so the top part shouldn't be necessary.
Can you share a link to this justification, as IMHO copying content without directly indicating the original license is bad practice.
Legal has provided us with additional language to include in the contributor guidelines indicating how third party content is licensed, so the top part shouldn't be necessary. We just need all the links attributed. Having the links as a list in the comments of the file should also be sufficient, so long as it is clear in the template that this information is required. If no links are used, then in the comments the contributor should write something like "self-authored".
Does this refer to the changes in #359? That adds a list of acceptable licenses for 3rd party content to CONTRIBUTING.md but it doesn't seem to address the attribution question, or how to indicate what license a particular piece of content is under.
Legal has provided us with additional language to include in the contributor guidelines indicating how third party content is licensed, so the top part shouldn't be necessary. We just need all the links attributed. Having the links as a list in the comments of the file should also be sufficient, so long as it is clear in the template that this information is required. If no links are used, then in the comments the contributor should write something like "self-authored".
Does this refer to the changes in #359? That adds a list of acceptable licenses for 3rd party content to CONTRIBUTING.md but it doesn't seem to address the attribution question, or how to indicate what license a particular piece of content is under.
Yes, legal determined this statement is sufficient. It indicates that the source links provided for contributions of third party content should be used to determine the license of third party content.
In standup today we confirmed that as source is required, it should remain a field in the YAML, and not be embedded in the contents. For examples where there are multiple sources, is there any reason why we can't just have this be a free text field with URLs seperated by commas?
e.g. source: 'https://en.wikipedia.org/wiki/Etan_Boritzer, https://en.wikipedia.org/wiki/Harry_S._Webb, https://en.wikipedia.org/wiki/Ian_Barry_(director), https://en.wikipedia.org/wiki/Pinto_Rustlers, https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459, https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter), https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham), https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032, https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history, https://en.wikipedia.org/wiki/Peter_Levin'
source
could also be an array which would be easier to edit and read for humans when there are multiple attributions:
source:
- https://en.wikipedia.org/wiki/Etan_Boritzer
- https://en.wikipedia.org/wiki/Harry_S._Webb
- https://en.wikipedia.org/wiki/Ian_Barry_(director)
- https://en.wikipedia.org/wiki/Pinto_Rustlers
- https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
- https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
- https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
- https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
- https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
- https://en.wikipedia.org/wiki/Peter_Levin
Single attribution (one element in array):
source: [https://en.wikipedia.org/wiki/Pinto_Rustlers]
So is source
only required when a context
field is present? That is, is source
only about attribution for the information in the context
field?
Also, we may want to consider naming the field attribution
to be more explicit as source
is slightly vague.
I'm open to calling it attribution or source, either is fine, so long as documentation matches. It is required for everything. If there is no third party content, the contributor should just say "self-authored"
Yes, legal determined this statement is sufficient. It indicates that the source links provided for contributions of third party content should be used to determine the license of third party content.
IMHO that is unsatisfactory for license compliance. Links break over time. The target may change the license of their site content over time. Wikipedia itself just changed from CC-BY-SA-3.0 to CC-BY-SA-4.0 last year.
If contributors are copying existing copyrighted content into this project, they should be expected to record the license that was identified at the point in time it was copied in. This is not a high burden on contributors, so I don't see a reason why it should be omitted.
Tools like https://reuse.software/ which validate licensing expect to see per-file copyright statements in the form of SPDX license identifiers in every file.
source: - https://en.wikipedia.org/wiki/Etan_Boritzer - https://en.wikipedia.org/wiki/Harry_S._Webb - https://en.wikipedia.org/wiki/Ian_Barry_(director) - https://en.wikipedia.org/wiki/Pinto_Rustlers - https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459 - https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter) - https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham) - https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032 - https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history - https://en.wikipedia.org/wiki/Peter_Levin
This looks better than a comma separated list. I don't have strong feelings on whether this should be called source
or attribution
so long as the meaning of the field is well-defined.
I share the concerns raised by @berrange though - I don't think simply linking to the content is sufficient to meet the terms the content is made available under. CC-BY-SA 4.0 requires that you link to the license. Perhaps a license
key could be used to express this? E.g.:
source:
- https://en.wikipedia.org/wiki/Etan_Boritzer
- https://en.wikipedia.org/wiki/Harry_S._Webb
- https://en.wikipedia.org/wiki/Ian_Barry_(director)
- https://en.wikipedia.org/wiki/Pinto_Rustlers
- https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
- https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
- https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
- https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
- https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
- https://en.wikipedia.org/wiki/Peter_Levin
license: CC-BY-SA-4.0
So is
source
only required when acontext
field is present? That is, issource
only about attribution for the information in thecontext
field?
Most of the existing cases in the current taxonomy have the content in the context
field. There's one example where it's included directly in the answer
field, with no context
present. This is in the history skill. The content in this example comes from libretexts.org and is CC-BY 4.0.
In a lot of other cases the content is included in both the context
and answer
fields - the grammar skill is one example of this.
license: CC-BY-SA-4.0
Each attribution in an array may have different license terms. So a single peer license element wont really work. We would need to have license information next to each attribution url.
attribution:
- {source: https://en.wikipedia.org/wiki/Etan_Boritzer, license: CC-BY-SA-4.0}
- {source: https://en.wikipedia.org/wiki/Harry_S._Webb, license: CC-BY-SA-3.0}
- {source: https://en.wikipedia.org/wiki/Ian_Barry_(director), license: CC-BY-SA-4.0}
- {source: https://en.wikipedia.org/wiki/Pinto_Rustlers, license: CC-BY-SA-4.0}
- {source: https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459, license: CC-BY-SA-3.0}
- {source: https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter), license: CC-BY-SA-4.0}
- {source: https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham), license: CC-BY-SA-4.0}
- {source: https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032, license: CC-BY-SA-4.0}
- {source: https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history, license: CC-BY-SA-4.0}
- {source: https://en.wikipedia.org/wiki/Peter_Levin, license: CC-BY-SA-4.0}
Thanks, I wasn't sure how to express that in YAML.
YAMLLint reformats that notation as:
attribution:
- source: https://en.wikipedia.org/wiki/Etan_Boritzer
license: CC-BY-SA-4.0
- source: https://en.wikipedia.org/wiki/Harry_S._Webb
license: CC-BY-SA-3.0
- source: https://en.wikipedia.org/wiki/Ian_Barry_(director)
license: CC-BY-SA-4.0
- source: https://en.wikipedia.org/wiki/Pinto_Rustlers
license: CC-BY-SA-4.0
- source: https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
license: CC-BY-SA-3.0
- source: https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
license: CC-BY-SA-4.0
- source: https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
license: CC-BY-SA-4.0
- source: https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
license: CC-BY-SA-4.0
- source: https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
license: CC-BY-SA-4.0
- source: https://en.wikipedia.org/wiki/Peter_Levin
license: CC-BY-SA-4.0
Which actually looks a bit more readable.
Okay confirmed we should use this format:
attribution:
If it self attributed, would this make sense from a documentation perspective?
attribution: 'self-authored'
Or will the .yaml file get messed up if sometimes it is array with source/license and sometimes it is just a single line of text?
I'm not sure how that works in YAML. From the Python side I think code loading the YAML would need to explicitly handle the two cases (a string vs. a list of dicts?).
For original content would it work to just use something like this:
attribution:
source: self-authored
license: Apache-2.0
That would keep the structure consistent in both cases.
I will work on updating the docs, etc. for this.
I created https://github.com/instruct-lab/taxonomy/pull/492. Please review the readme changes to make sure I properly captured the decisions here.
@katesoule I've been trying to follow this discussion as I continue to be concerned about this Wikipedia/hotpot_qa content. It sounds like from https://github.com/instruct-lab/taxonomy/issues/182#issuecomment-1995691070 for the existing content that is based on hotpot_qa you are now going to provide, in the qna.yml file, both a link to the (current) relevant wikipedia.org article and an indication of the license (if not a link to the license text, at least a relatively standard name for the license)? And that the CC-BY-SA-3.0/CC-BY-SA-4.0 distinction is being accounted for? Or am I misunderstanding and is that more about how you will deal with future contributed CC-BY-SA (etc.) content but you are not going to do this for the hotpot_qa content?
In the case where there are grounded skills being submitted (and in the future when knowledge is submitted), to promote transparency and traceability, Legal is advising we ask contributors to provide a link that shows where they got any external document used for the skill's context or knowledge source.