Closed ylockerman closed 3 days ago
@sgpyc do we have updated instructions here?
The License.txt
file in that google drive describes that we are covered by Creative Commons Attribution-Sharealike 3.0 Unported License. Does that answer your question? I believe we took the raw Wikipedia file as source, but the reason we're hosting here in a google drive is that Wikipedia rotates its archives, so we needed a stable place for people to repro, but our intention is to follow the Wikipedia license.
Closing because it is resolve by @johntran-nv https://github.com/mlcommons/training/issues/477#issuecomment-1331402126
Hi,
It seems like the BERT benchmark requires a number of ancillary files in addition to the Wikipedia data (i.e. Model checkpoints, vocab file, settings file) that are needed to reproduce the closed benchmark. However, I can't find any definitive source to the license of these files. Nor can I find the provenance of the checkpoint (i.e. what data is was trained on).
It would be very helpful if the above information was available so we could evaluate any legal risk of performing the benchmark.
Thank You
p.s. My assumption is that the model was trained from Wikipedia and the rest of the files are either CC or Apache 2.0. However, I could not find that documented anywhere and the license file in the google drive is ambiguous if it includes those files.