instructlab / taxonomy

Taxonomy tree that will allow you to create models tuned with your data
Apache License 2.0
176 stars 606 forks source link

Adding a source field to the YAML for grounded skills / knowledge contributions #182

Closed katesoule closed 6 months ago

katesoule commented 6 months ago

In the case where there are grounded skills being submitted (and in the future when knowledge is submitted), to promote transparency and traceability, Legal is advising we ask contributors to provide a link that shows where they got any external document used for the skill's context or knowledge source.

obuzek commented 6 months ago
bmr-cymru commented 6 months ago

Should this also include the license the content was made available under where relevant? See also #255. It's a bit unclear at the moment - for example the CONTRIBUTIONS guide states "DO NOT contribute copyrighted content or content coming from another system.".

katesoule commented 6 months ago

No processing needed in the CLI, but we need users to provide a link to any third party source they might have leveraged when issuing a PR to the taxonomy repo (knowledge or skill contribution). Once we get this in place, we can remove this restriction in the contribution guidelines, allowing people to leverage content from other sources so long as they are permissible licensed and they share a link. I'm working with legal on the exact list of license and how this should be articulated on the Contributions page, will share that once its ready.

katesoule commented 6 months ago

Adding a few more details from legal:

This field should sit at the same level as the question, answer, and context fields within each example, and should be required for everyone to fill out. If there is any third party content that is leveraged for the YAML (e.g. if all the QnAs were directly taken from a website full of puns, or if the context was copy and pasted from wikipedia), then the link of where that content was obtained should be provided in that field.

If the entire QnA example was created by hand (e.g. I made up a bunch of puns on the top of my head), then the contributor should write something akin to "Self-Created".

Separately we will need to update the contributor guidance to make sure we have instructions on how to fill out this field, and to only include third party content that is licensed according to a list of license recommendations (still being finalized).

xukai92 commented 6 months ago

is this just for transparency and traceability but not for any actual use of that field? if it's a URL or something dynamic as the source, how do we make sure the content of the source is unchanged? I'm a bit concerned about a case where at the time of filling the YAML, the source URL has no issue but later on the content of the source URL is modified and potentially contains malicious materials. people might go visiting the link after seeing it as the source filed in the YAML

katesoule commented 6 months ago

This is not intended to be used by any part of the pipeline, or for reviewers to even click on it. But it is required we have some sort of attribution for third party content if we want to allow people to contribute content that leverages any of the most commonly used licenses (CC-BY-SA, Apache 2.0, MIT, etc. all require attribution).

I defer to you and the team on the security risk this may introduce. Are there other open source projects that have similar contributions styles that we can learn from?

xukai92 commented 6 months ago

This is not intended to be used by any part of the pipeline, or for reviewers to even click on it. But it is required we have some sort of attribution for third party content if we want to allow people to contribute content that leverages any of the most commonly used licenses (CC-BY-SA, Apache 2.0, MIT, etc. all require attribution).

Thanks for clarifying!

I defer to you and the team on the security risk this may introduce. Are there other open source projects that have similar contributions styles that we can learn from?

Me and the team are not legal experts so I don't think we should be making decisions here. Contributing documents is not common in open-source projects but for codes there is stuff like DCO, which we are currently enforcing in the cli repo.

... Legal is advising we ask contributors to provide a link that shows where they got any external document used for the skill's context or knowledge source ...

It looks like you have consulted legal on this. Were there alternative options discussed or so? I know little about what has been discussed but perhaps you could share more information about what has been discussed and how this decision was made.

katesoule commented 6 months ago

DCO is also being required in the taxonomy repo, but unfortunately doesn't count as attribution. When you attribute something you have to cite where you found it, and we need to provide traceability so that you can go back and inspect whether any prior modifications were made to the content that was contributed (e.g. there is a HotPot dataset we are using for the context field in our seed data for writing skills, this dataset is actually a modification of wikipedia data, the license for this dataset requires that we can back and trace those modifications).

Legal didn't provide any alternatives for how we can attribute these types of sources, but I can ask and see if they have other ideas beyond a link (although I'm not hopeful). I'm curious if @lhawthorn has suggestions, as she was one of the first people to have suggest we should be collecting links for all the contributed sources.

katesoule commented 6 months ago

The other thought I had is we could put in our contributor policies to only leverage third party content from other trusted and goverened sites used to share open source data and other open source content, like Hugging Face, wikipedia, or .gov websites. Although this will restrict some of the creativity of contributions.

lhawthorn commented 6 months ago

We could follow the same style as Wikipedia references, namely having the submitter cite original source on the web along with a link to the content preserved at time of access via Internet Archive's Wayack Machine (preserving URLs via https://archive.is/).

To be entirely clear, here is an example using the Wikipedia page for Red Hat:

https://en.wikipedia.org/wiki/Red_Hat#References

For reference 2 in this list, the content is attributed to a page that no longer exists (accessing yields 404 error): https://www.redhat.com/en/about/company/management/paul-cormier

But one can still access the content as it appeared on the date of original access via: https://web.archive.org/web/20200406151430/https://www.redhat.com/en/about/company/management/paul-cormier

While requesting that this content be included along with a PR is an extra step for the contributor, it's a simple step and in keeping with project norms for open content projects.

We may wish to consider supporting Internet Archive financially with a tax deductible donation in recognition of the valuable service the organization would provide to us as part of running this project. I would certainly recommend doing so, though of course it is not required.

https://archive.org/donate

berrange commented 6 months ago

But it is required we have some sort of attribution for third party content if we want to allow people to contribute content that leverages any of the most commonly used licenses (CC-BY-SA, Apache 2.0, MIT, etc. all require attribution).

Whether it was intended to be allowed or not, this has in fact already happened. There are multiple examples of YAML files in git tree which have directly copied text from wikipedia and thus are CC-BY-SA. See #255 for the examples.

bmr-cymru commented 6 months ago

is this just for transparency and traceability but not for any actual use of that field? if it's a URL or something dynamic as the source, how do we make sure the content of the source is unchanged? I'm a bit concerned about a case where at the time of filling the YAML, the source URL has no issue but later on the content of the source URL is modified and potentially contains malicious materials. people might go visiting the link after seeing it as the source filed in the YAML

For Wikipedia (which accounts for almost all the 3rd party content currently in the tree) it's possible to link to a specific revision of the article. This is necessary for a number of the existing cases since they contain content taken from 3-4 year old articles that have since changed on the main page. For e.g. https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459 - this content is present in the key_points skill, but no longer appears at https://en.wikipedia.org/wiki/Les_Richards

The only exception to this is pages that have actually been deleted since the content was collected. Perhaps for these (few) examples we could use archive.org as @lhawthorn suggests.

bmr-cymru commented 6 months ago

Adding a few more details from legal: This field should sit at the same level as the question, answer, and context fields within each example, and should be required for everyone to fill out. If there is any third party content that is leveraged for the YAML (e.g. if all the QnAs were directly taken from a website full of puns, or if the context was copy and pasted from wikipedia), then the link of where that content was obtained should be provided in that field.

One thing to bear in mind is that this format is potentially awkward for existing examples in the taxonomy since a single question/answer/context contains content from multiple distinct sources. For e.g. the main_takeaway skill has a context key that is a concatenation of content from ten separate Wikipedia URLs.

Since the information isn't intended to be used as part of the pipeline perhaps it would be easier to express this in comments at the top of the file? This would allow linking to arbitrary lists of source URLs and would provide the necessary attribution.

katesoule commented 6 months ago

Since the information isn't intended to be used as part of the pipeline perhaps it would be easier to express this in comments at the top of the file? This would allow linking to arbitrary lists of source URLs and would provide the necessary attribution.

This sounds like it could be a good solution, can you provide an example of what this would actually look like to someone who navigates to the taxonomy repo for the first time and is inspecting a skill? I can take the example to legal and ask if this would be sufficient attribution. Agreed on the point of needing to list multiple source URLs.

berrange commented 6 months ago

This sounds like it could be a good solution, can you provide an example of what this would actually look like to someone who navigates to the taxonomy repo for the first time and is inspecting a skill? I can take the example to legal and ask if this would be sufficient attribution. Agreed on the point of needing to list multiple source URLs.

The attribution requirements in licenses don't specify a particular syntax, at most they would say what kind of information needs to be attributed. Example required by CC-BY-SA:

https://creativecommons.org/licenses/by-sa/4.0/deed.en#ref-appropriate-credit

Attribution — You must give appropriate credit [1], provide a link to the license,
and indicate if changes were made. You may do so in any reasonable manner, 
but not in any way that suggests the licensor endorses you or your use.

[1] appropriate credit — If supplied, you must provide the name of the creator
and attribution parties, a copyright notice, a license notice, a disclaimer notice,
and a link to the material. CC licenses prior to Version 4.0 also require you to
provide the title of the material if supplied, and may have other slight differences. 

That shows it is potentially more than just a list of URLs, further info may be required depending on the license. If there's no compelling need for machine readable attribution, then utilizing comments at the top of the file for attribution is less burden and adaptable for differing attribution requirements of licenses.

Using free-form comments for attribution is what I typically see in source code where copying has taken place.

bmr-cymru commented 6 months ago

Reviewing Wikipedia:Copyrights, and CC-BY-SA 4.0 would something like the following be sufficient?

The attribution guidelines from Wikipedia:Copyrights state:

To re-distribute text on Wikipedia in any form, provide credit to the authors either by including a) a hyperlink (where possible) or URL to the page or pages you are re-using, b) a hyperlink (where possible) or URL to an alternative, stable online copy which is freely accessible, which conforms with the license, and which provides credit to the authors in a manner equivalent to the credit given on this website, or c) a list of all authors.

Using the key_points skill as an example:

# The following YAML contains copyright content excerpted from Wikipedia and made available
# under the Creative Commons Attribution-ShareAlike 4.0 International Public License. The
# content retains the copyright of the respective Wikipedia editors and contributors.
#
# https://creativecommons.org/licenses/by-sa/4.0/
# 
# The following URLs are attributed as the original source of this content:
# 
# https://en.wikipedia.org/wiki/Etan_Boritzer
# https://en.wikipedia.org/wiki/Harry_S._Webb
# https://en.wikipedia.org/wiki/Ian_Barry_(director)
# https://en.wikipedia.org/wiki/Pinto_Rustlers
# https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
# https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
# https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
# https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
# https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
# https://en.wikipedia.org/wiki/Peter_Levin
created_by: shivsr
seed_examples:
- answer: '1. Etan Boritzer is an American writer of children\''s literature, best
    known for his book "What is God?" published in 1989.\n2. His "What is?" series,
    which includes books like "What is Love?", "What is Death?", "What is Beautiful?",
    etc., is a popular teaching guide for parents, teachers, and child-life professionals.\n3.
    The series has caused controversy due to its universalist views and has been translated
    into 15 languages.\n4. Boritzer was first published at the age of 13 and now lives
    in Venice, California, where he maintains his publishing office.\n5. He has helped
    numerous other authors get published and is also a yoga teacher and an erudite
    speaker on "The Teachings of the Buddha."\n6. Harry S. Webb was an American film
    producer, director, and screenwriter who produced and directed 100 films between
    1924 and 1940.\n7. Webb and his wife, Rose Gordon, created Reliable Pictures Corporation
    in 1933, which produced Westerns until 1937.\n8. Webb then started Metropolitan
    Pictures Corporation in 1938, which produced several films until 1940.\n9. Ian
    Barry is an Australian director of film and TV.\n10. "Pinto Rustlers" is a 1936
    American western film directed by Harry S. Webb and starring Tom Tyler, George
    Walsh, and Al St. John.\n11. Les Richards was an Australian rules footballer who
    played with North Melbourne in the Victorian Football League (VFL).\n12. Brian
    Saunders was a male weightlifter who competed for England.\n13. Theodred II was
    a medieval Bishop of Elmham, whose date of consecration is unknown, but the date
    of his death was between 995 and 997.\n14. Terence D. Robinson was a male wrestler
    who competed for England.\n15. Pamela Jain is an Indian playback singer, born
    on 16th March.\n16. Peter Levin is an American director of film, television, and
    theatre.

    '
  context: "Etan Boritzer( born 1950) is an American writer of children \u2019s literature\
    \ who is best known for his book\" What is God?\" first published in 1989.\\n\\\
    nHis best selling\" What is?\" illustrated children\\'s book series on character\
    \ education and difficult subjects for children is a popular teaching guide for\
    \ parents, teachers and child- life professionals.\\n\\nBoritzer gained national\
    \ critical acclaim after\" What is God?\" was published in 1989 although the book\
    \ has caused controversy from religious fundamentalists for its universalist views.\\\
    n\\nThe other current books in the\" What is?\" series include\\n\\nWhat is Love?,\\\
    n\\nWhat is Death?,\\n\\nWhat is Beautiful?,\\n\\nWhat is Funny?,\\n\\nWhat is\
    \ Right?,\\n\\nWhat is Peace?,\\n\\nWhat is Money?,\\n\\nWhat is Dreaming?,\\\
    n\\nWhat is a Friend?,\\n\\nWhat is True?,\\n\\nWhat is a Family?,\\n\\nWhat is\
    \ a Feeling?\"\\n\\nThe series is now also translated into 15 languages.\\n\\\
    nBoritzer was first published in 1963 at the age of 13 when he wrote an essay\
    \ in his English class at Wade Junior High School in the Bronx, New York on the\
    \ assassination of John F. Kennedy.\\n\\nHis essay was included in a special anthology\
    \ by New York City public school children compiled and published by the New York\
    \ City Department of Education.\\n\\nBoritzer now lives in Venice, California\
    \ and maintains his publishing office there also.\\n\\nHe has helped numerous\
    \ other authors to get published through\" How to Get Your Book Published!\" programs.\\\
    n\\nBoritzer is also a yoga teacher who teaches regular classes locally and guest-\
    \ teaches nationally.\\n\\nHe is also recognized nationally as an erudite speaker\
    \ on\" The Teachings of the Buddha.\"\\nHarry S. Webb (October 15, 1892 \u2013\
    \ July 4, 1959) was an American film producer, director and screenwriter.\\n\\\
    nHe produced 100 films between 1924 and 1940.\\n\\nHe also directed 55 films between\
    \ 1924 and 1940.\\n\\nHe was the brother of \"B\"-film producer and director Ira\
    \ S. Webb and the husband of screenwriter Rose Gordon, who wrote many of his films.\\\
    n\\nIn 1933 Webb and Bernard B. Ray created Reliable Pictures Corporation with\
    \ a studio at Beachwood and Sunset Boulevard in Hollywood.\\n\\nReliable produced\
    \ and released many Westerns, starting with \"Girl Trouble\" (1933), until the\
    \ company closed in 1937.\\n\\nIts final release was \"The Silver Trail\".\\n\\\
    nWebb and Ray then started Metropolitan Pictures Corporation in 1938, which produced\
    \ and released several films until 1940, its last being \"Pinto Canyon\".\\n\\\
    nWebb then produced Westerns for Monogram Pictures.\\n\\nHe was born in Pennsylvania\
    \ and died in Hollywood, from a heart attack.\\nIan Barry is an Australian director\
    \ of film and TV.\\nPinto Rustlers is a 1936 American western film directed by\
    \ Harry S. Webb and starring Tom Tyler, George Walsh and Al St. John.\\nLes Richards(\
    \ date of birth unknown) was an Australian rules footballer who played with North\
    \ Melbourne in the Victorian Football League( VFL).\\nBrian Saunders( date of\
    \ birth and death unknown) was a male weightlifter who competed for England.\\\
    nTheodred II was a medieval Bishop of Elmham.\\n\\nThe date of Theodred\\'s consecration\
    \ unknown, but the date of his death was sometime between 995 and 997.\\nTerence\
    \ D. Robinson( date of birth and death unknown) was a male wrestler who competed\
    \ for England.\\nPamela Jain is an Indian playback singer.\\n\\nDate of Birth:16th\
    \ March.\\nPeter Levin is an American director of film, television and theatre.\n"
  question: Generate the key points from the given text.
task_description: ''
katesoule commented 6 months ago

Legal has provided us with additional language to include in the contributor guidelines indicating how third party content is licensed, so the top part shouldn't be necessary. We just need all the links attributed. Having the links as a list in the comments of the file should also be sufficient, so long as it is clear in the template that this information is required. If no links are used, then in the comments the contributor should write something like "self-authored".

berrange commented 6 months ago

Legal has provided us with additional language to include in the contributor guidelines indicating how third party content is licensed, so the top part shouldn't be necessary.

Can you share a link to this justification, as IMHO copying content without directly indicating the original license is bad practice.

bmr-cymru commented 6 months ago

Legal has provided us with additional language to include in the contributor guidelines indicating how third party content is licensed, so the top part shouldn't be necessary. We just need all the links attributed. Having the links as a list in the comments of the file should also be sufficient, so long as it is clear in the template that this information is required. If no links are used, then in the comments the contributor should write something like "self-authored".

Does this refer to the changes in #359? That adds a list of acceptable licenses for 3rd party content to CONTRIBUTING.md but it doesn't seem to address the attribution question, or how to indicate what license a particular piece of content is under.

katesoule commented 6 months ago

Legal has provided us with additional language to include in the contributor guidelines indicating how third party content is licensed, so the top part shouldn't be necessary. We just need all the links attributed. Having the links as a list in the comments of the file should also be sufficient, so long as it is clear in the template that this information is required. If no links are used, then in the comments the contributor should write something like "self-authored".

Does this refer to the changes in #359? That adds a list of acceptable licenses for 3rd party content to CONTRIBUTING.md but it doesn't seem to address the attribution question, or how to indicate what license a particular piece of content is under.

Yes, legal determined this statement is sufficient. It indicates that the source links provided for contributions of third party content should be used to determine the license of third party content.

katesoule commented 6 months ago

In standup today we confirmed that as source is required, it should remain a field in the YAML, and not be embedded in the contents. For examples where there are multiple sources, is there any reason why we can't just have this be a free text field with URLs seperated by commas?

e.g. source: 'https://en.wikipedia.org/wiki/Etan_Boritzer, https://en.wikipedia.org/wiki/Harry_S._Webb, https://en.wikipedia.org/wiki/Ian_Barry_(director), https://en.wikipedia.org/wiki/Pinto_Rustlers, https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459, https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter), https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham), https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032, https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history, https://en.wikipedia.org/wiki/Peter_Levin'

bjhargrave commented 6 months ago

source could also be an array which would be easier to edit and read for humans when there are multiple attributions:

source: 
  - https://en.wikipedia.org/wiki/Etan_Boritzer
  - https://en.wikipedia.org/wiki/Harry_S._Webb
  - https://en.wikipedia.org/wiki/Ian_Barry_(director)
  - https://en.wikipedia.org/wiki/Pinto_Rustlers
  - https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
  - https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
  - https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
  - https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
  - https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
  - https://en.wikipedia.org/wiki/Peter_Levin

Single attribution (one element in array):

source: [https://en.wikipedia.org/wiki/Pinto_Rustlers]

So is source only required when a context field is present? That is, is source only about attribution for the information in the context field?

Also, we may want to consider naming the field attribution to be more explicit as source is slightly vague.

katesoule commented 6 months ago

I'm open to calling it attribution or source, either is fine, so long as documentation matches. It is required for everything. If there is no third party content, the contributor should just say "self-authored"

berrange commented 6 months ago

Yes, legal determined this statement is sufficient. It indicates that the source links provided for contributions of third party content should be used to determine the license of third party content.

IMHO that is unsatisfactory for license compliance. Links break over time. The target may change the license of their site content over time. Wikipedia itself just changed from CC-BY-SA-3.0 to CC-BY-SA-4.0 last year.

If contributors are copying existing copyrighted content into this project, they should be expected to record the license that was identified at the point in time it was copied in. This is not a high burden on contributors, so I don't see a reason why it should be omitted.

Tools like https://reuse.software/ which validate licensing expect to see per-file copyright statements in the form of SPDX license identifiers in every file.

bmr-cymru commented 6 months ago
source: 
  - https://en.wikipedia.org/wiki/Etan_Boritzer
  - https://en.wikipedia.org/wiki/Harry_S._Webb
  - https://en.wikipedia.org/wiki/Ian_Barry_(director)
  - https://en.wikipedia.org/wiki/Pinto_Rustlers
  - https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
  - https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
  - https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
  - https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
  - https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
  - https://en.wikipedia.org/wiki/Peter_Levin

This looks better than a comma separated list. I don't have strong feelings on whether this should be called source or attribution so long as the meaning of the field is well-defined.

I share the concerns raised by @berrange though - I don't think simply linking to the content is sufficient to meet the terms the content is made available under. CC-BY-SA 4.0 requires that you link to the license. Perhaps a license key could be used to express this? E.g.:

source:
   - https://en.wikipedia.org/wiki/Etan_Boritzer
   - https://en.wikipedia.org/wiki/Harry_S._Webb
   - https://en.wikipedia.org/wiki/Ian_Barry_(director)
   - https://en.wikipedia.org/wiki/Pinto_Rustlers
   - https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
   - https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
   - https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
   - https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
   - https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
   - https://en.wikipedia.org/wiki/Peter_Levin
license: CC-BY-SA-4.0

So is source only required when a context field is present? That is, is source only about attribution for the information in the context field?

Most of the existing cases in the current taxonomy have the content in the context field. There's one example where it's included directly in the answer field, with no context present. This is in the history skill. The content in this example comes from libretexts.org and is CC-BY 4.0.

In a lot of other cases the content is included in both the context and answer fields - the grammar skill is one example of this.

bjhargrave commented 6 months ago

license: CC-BY-SA-4.0

Each attribution in an array may have different license terms. So a single peer license element wont really work. We would need to have license information next to each attribution url.

attribution:
   - {source: https://en.wikipedia.org/wiki/Etan_Boritzer, license: CC-BY-SA-4.0}
   - {source: https://en.wikipedia.org/wiki/Harry_S._Webb, license: CC-BY-SA-3.0}
   - {source: https://en.wikipedia.org/wiki/Ian_Barry_(director), license: CC-BY-SA-4.0}
   - {source: https://en.wikipedia.org/wiki/Pinto_Rustlers, license: CC-BY-SA-4.0}
   - {source: https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459, license: CC-BY-SA-3.0}
   - {source: https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter), license: CC-BY-SA-4.0}
   - {source: https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham), license: CC-BY-SA-4.0}
   - {source: https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032, license: CC-BY-SA-4.0}
   - {source: https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history, license: CC-BY-SA-4.0}
   - {source: https://en.wikipedia.org/wiki/Peter_Levin, license: CC-BY-SA-4.0}
bmr-cymru commented 6 months ago

Thanks, I wasn't sure how to express that in YAML.

bmr-cymru commented 6 months ago

YAMLLint reformats that notation as:

attribution:
  - source: https://en.wikipedia.org/wiki/Etan_Boritzer
    license: CC-BY-SA-4.0
  - source: https://en.wikipedia.org/wiki/Harry_S._Webb
    license: CC-BY-SA-3.0
  - source: https://en.wikipedia.org/wiki/Ian_Barry_(director)
    license: CC-BY-SA-4.0
  - source: https://en.wikipedia.org/wiki/Pinto_Rustlers
    license: CC-BY-SA-4.0
  - source: https://en.wikipedia.org/w/index.php?title=Lachlan_Richards&oldid=953151459
    license: CC-BY-SA-3.0
  - source: https://en.wikipedia.org/wiki/Brian_Saunders_(weightlifter)
    license: CC-BY-SA-4.0
  - source: https://en.wikipedia.org/wiki/Theodred_II_(Bishop_of_Elmham)
    license: CC-BY-SA-4.0
  - source: https://en.wikipedia.org/w/index.php?title=Terence_Robinson&oldid=1181905032
    license: CC-BY-SA-4.0
  - source: https://en.wikipedia.org/w/index.php?title=Pamela_Jain&action=history
    license: CC-BY-SA-4.0
  - source: https://en.wikipedia.org/wiki/Peter_Levin
    license: CC-BY-SA-4.0

Which actually looks a bit more readable.

katesoule commented 6 months ago

Okay confirmed we should use this format:

attribution:

If it self attributed, would this make sense from a documentation perspective?

attribution: 'self-authored'

Or will the .yaml file get messed up if sometimes it is array with source/license and sometimes it is just a single line of text?

bmr-cymru commented 6 months ago

I'm not sure how that works in YAML. From the Python side I think code loading the YAML would need to explicitly handle the two cases (a string vs. a list of dicts?).

For original content would it work to just use something like this:

attribution:
  source: self-authored
  license: Apache-2.0

That would keep the structure consistent in both cases.

bjhargrave commented 6 months ago

I will work on updating the docs, etc. for this.

bjhargrave commented 6 months ago

I created https://github.com/instruct-lab/taxonomy/pull/492. Please review the readme changes to make sure I properly captured the decisions here.

richardfontana commented 6 months ago

@katesoule I've been trying to follow this discussion as I continue to be concerned about this Wikipedia/hotpot_qa content. It sounds like from https://github.com/instruct-lab/taxonomy/issues/182#issuecomment-1995691070 for the existing content that is based on hotpot_qa you are now going to provide, in the qna.yml file, both a link to the (current) relevant wikipedia.org article and an indication of the license (if not a link to the license text, at least a relatively standard name for the license)? And that the CC-BY-SA-3.0/CC-BY-SA-4.0 distinction is being accounted for? Or am I misunderstanding and is that more about how you will deal with future contributed CC-BY-SA (etc.) content but you are not going to do this for the hotpot_qa content?