Closed lhawthorn closed 4 months ago
It is not spectacularly helpful as a "domain" from which to draw from because lists must be filtered according to specific parameters and the outputs verified as not being in conflict with our approved licenses list , but there are a number of open access journals from which we could and should draw cataloged on the Directory of Open Access Journals website.
82% of their submissions come from academic organizations, so it may behoove us to consider if we wish to scope our contributions in terms of domain (e.g. nasa.gov) versus publisher (e.g. Universidad de Chile).
It is also worth asking our legal team if sources that bear the DOAJ Seal meet our criteria for knowledge contributions provided they also fall under the scope of our acceptable licenses.
For example, here are query results where I have asked DOAJ to list out journals with the topic "science" where the content is licensed according to our approved licenses list rubric.
This list would need to be further vetted since some content is dual licensed in ways that would not work for inclusion per our approved licenses list (e.g. dual licensed CC BY and CC BY NC ND).
If we were to accept the DOAJ seal as part of our "approved content" rubric, it would rapidly open a number of open access publications for knowledge contributions.
Would like to see InstructLab approve submissions from the PLOS open access journals family of publications.
Would like to see InstructLab approve submissions from the OpenStax textbooks family of publications.
In an ideal universe, reviewing and adjudicating upon this list of proposed data sources and the list of data sources proposed already extant in the Knowledge Submissions Past Wikipedia design document would allow us to create and publish a rubric against which the triager team could automatically approve data sources for knowledge contributions to the project without requiring another round of review by legal experts. I believe such an outcome is possible and we should make it a project goal to get there. (This feedback may belong in it's own issue, but will file that one when the time is right.)
I would like to be able to add knowledge and skills based on the following (to start with):
We would like to add knowledge from the Open Education Project. https://research.redhat.com/blog/research_project/foundations-in-open-source-education/
I would like to add knowledge from ibm.com.
I would like to add knowledge from ibm.com.
Wouldn't that potentially break the ToS? https://www.ibm.com/legal/terms
I would like to add knowledge from ibm.com.
Wouldn't that potentially break the ToS? https://www.ibm.com/legal/terms
I didn't see anything that prevents using the data in LLMs.
I'm not a lawyer but a couple of things stood out to me, such as: "You may not mirror any of the content from this site on another Web site or in any other media. "
We should check with Legal on that because it does say that web crawling is allowed as permitted in the robots.txt protocol, which seems to allow https://www.ibm.com/products/*. (My reason for wanting to use ibm.com is to add knowledge about IBM products.
@kwright15 Thank you for your request. Following up from a conversation with Red Hat's legal team, we have been notified that inclusion of these materials would need to be accompanied by an approval statement from IBM's legal team. If that's important to you, we can help you make this ask and I would ask my fellow Community Maintainers @joesepi @jjasghar and @mmcelaney for their assistance.
Folks, following meetings with knowledgeable legal folks, the following sources are now approved for Knowledge submissions:
We further have the following guidance: While it seems easy to say that a site on e.g. a .gov domain means the content falls into the public domain, it is actually not that simple as different subsections or, for example, images, may fall under different licensing schemes.
Further, there are a number of times where we find that the metadata for a particular document may be incorrect and therefore a blanket approval for content from a site may yield incorrect assumptions about whether or not we should take in this data as a knowledge source, e.g. Internet Archive searches often return results listed as out of copyright, e.g. pre-1927 works, but that metadata is inaccurate.
Lastly, there are a number of sites detailed in this requested list that appear to only house openly licensed data sources that appear in our list of acceptable content licenses, but some of the content linked therein is not actually under an open content license. Because of this detail, it is not possible to provide a blanket approval to use content at that site.
Bottom line in the discussion was that even more knowledge submissions from various data sources will require hand vetting of each incoming data source to ensure it meets our acceptable license criteria. This does not necessarily require a legal professional to do, but does put the onus upon the Triaging team and/or other maintainers of the Taxonomy repo to understand the licensing criteria and to vet that incoming content.
I have already volunteered as tribute for the Triage team to help them with these content licensing questions, mostly because I am one of those strange people who actually finds this sort of work interesting. :D
I hope to have made @jimmysjolund had a happier day with some of the updates in this issue and the related PR to update our dev docs. I am going to close this issue when #105 is reviewed and merged because we will take in further requests for reviews of different content via a Pull Request against the dev doc updated in #105.
@kwright15 Thank you for your request. Following up from a conversation with Red Hat's legal team, we have been notified that inclusion of these materials would need to be accompanied by an approval statement from IBM's legal team. If that's important to you, we can help you make this ask and I would ask my fellow Community Maintainers @joesepi @jjasghar and @mmcelaney for their assistance.
Thank you @lhawthorn! I am interested in your help to get approval from IBM's legal team to contribute ibm.com knowledge. I am happy to reach out to the legal team, I'm just now sure who to start with.
@kwright15 I am sorry I missed your note until today, please reach out to me via email for help or ping me in the InstructLab Slack workspace.
For background on this issue, please review this document on InstructLab knowledge submissions beyond Wikipedia.
There are any number of other domains we may wish to add as approved content providers. Please use this issue to make recommendations for review by legal experts.