Open quq99 opened 1 year ago
@quq99 Good question. As the Flan Collection (or P3, or Natural Instructions v2) is a compilation of hundreds of different datasets, with many different licenses, the rendered data would not be under Apache 2.0.
I am actually working on a full labelling of the dataset licenses and plan to release this publicly soon, so that users can take the subset of Flan that fits their licensing constraints.
@shayne-longpre Thanks a lot! looking forward to that. When you finish, could you reply in this issue, so I could know. Appreciate your work!!
@quq99 Update: we plan to release this in the last week of May.
@shayne-longpre Looking forward to the dataset labeled with license. Thanks for the effort!
@shayne-longpre any update on the above license part? Were you able to complete it?
@balachandarsv apologies again for the wait on this. It turns out license labelling is much more complex than we had originally anticipated.
It has gone from a side project into my next major release, with a lot more data selection/partitioning features being added, not just for Flan, but a lot of relevant data sources. It's tentatively slated for mid-July. I hope this isn't too inconvenient and apologies again on the delay.
@shayne-longpre No problem at all. Please let me know in case if you need help in sorting out the data according to license. I will be happy to help! :-)
Hi @shayne-longpre thanks for labeling all the licenses in the Flan Collection! I'm a bit confused about the Flan-T5 models' Apache-2.0 license, i.e., if some datasets in the Flan Collection have to be removed due to license constraint, why the Flan-T5 models can have Apache-2.0? Were they trained with only permissive datasets?
Any updates?
Sorry for the long delay -- I am not at Google so haven't been maintaining this.
Licenses have been annotated for Flan and many more datasets here: https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection. It is not however legal advice -- the interpretation of the licenses to the data is complicated and requires a lawyer. These annotations are to provide information that can enable you to apply your own legal/ethical framework.
Hi,
Thanks a lot for open source the code to fetch the FLAN data set.
I noticed in the paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. (https://arxiv.org/abs/2301.13688) you mentioned
I noticed that this repo used Apache 2.0 license. Is the FLAN data set that fetched from the code also under Apache 2.0 license?
Thanks a lot!