mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.58k stars 549 forks source link

License/Source Materials for BERT checkpoint files/vocab/settings #477

Closed ylockerman closed 3 days ago

ylockerman commented 3 years ago

Hi,

It seems like the BERT benchmark requires a number of ancillary files in addition to the Wikipedia data (i.e. Model checkpoints, vocab file, settings file) that are needed to reproduce the closed benchmark. However, I can't find any definitive source to the license of these files. Nor can I find the provenance of the checkpoint (i.e. what data is was trained on).

It would be very helpful if the above information was available so we could evaluate any legal risk of performing the benchmark.

Thank You

p.s. My assumption is that the model was trained from Wikipedia and the rest of the files are either CC or Apache 2.0. However, I could not find that documented anywhere and the license file in the google drive is ambiguous if it includes those files.

johntran-nv commented 1 year ago

@sgpyc do we have updated instructions here?

johntran-nv commented 1 year ago

The License.txt file in that google drive describes that we are covered by Creative Commons Attribution-Sharealike 3.0 Unported License. Does that answer your question? I believe we took the raw Wikipedia file as source, but the reason we're hosting here in a google drive is that Wikipedia rotates its archives, so we needed a stable place for people to repro, but our intention is to follow the Wikipedia license.

hiwotadese commented 3 days ago

Closing because it is resolve by @johntran-nv https://github.com/mlcommons/training/issues/477#issuecomment-1331402126