huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Is it available for Commercial Use? #7

Closed souvikqb closed 7 months ago

souvikqb commented 8 months ago

Hello,

I'd like to know if the Distil Whisper model is available for commercial use.

Thanks,

patrickvonplaten commented 8 months ago

Yes it will be MIT licensed

dani-local commented 8 months ago

hmm... but it was built using data that cannot be used for commercial purposes... this is from the README:

image

CC-BY-NC-ND or Creative Commons Attribution NonCommercial NoDerivs, is the most restrictive license offered by Creative Commons. With this license, the user (while attributing the original creator) can only share the work but not change it in any way or ever use it commercially.

patrickvonplaten commented 8 months ago

The licenses shown there only apply to the data itself not a model that was trained on the data. TED-LIUM's license (as far as I know) doesn't force a model trained on it to be CC-BY-NC-ND it just means that the data itself cannot be sold etc...

Additionally, we only used the audio, not the transcriptions. I can double check, but from what I know there is absolutely no problem with Distil-Whisper being MIT.

dani-local commented 8 months ago

It would be good to double check, I'm not sure your interpretation is correct

TED-LIUM's license (as far as I know) doesn't force a model trained on it to be CC-BY-NC-ND it just means that the data itself cannot be sold etc...

Also, think of Gigaspeech, there is Youtube data in there, you cannot use that data to train models afaik. It is in youtube's terms and conditions I believe.

At the end it is your call, I'm just sharing this to make sure you are aware.

o-alexandre-felipe commented 8 months ago

@dani-local Interesting question but could you be objective. What is your interpretation?

Under common sense I would think you are right as the model would be a derivative of the data. But creative commons 3.0 definition don't include that.

If used to train an audio or image generation it could violate that term as those models may memorize part of the content. How you think one could use the model to independently get a version of the original data?

dani-local commented 8 months ago

How you think one could use the model to independently get a version of the original data?

I do not think that is possible, the model is about compressing the info. But that is not the point. Read here: image

"ever use it commercially" that is broad. If you build a commercial app on this data, then the data is used for commercial purposes. That is how I read it. I'm not a lawyer though.

See more here: https://opendata.stackexchange.com/questions/1661/what-can-open-data-with-a-cc-by-nc-nd-creative-commons-attribution-noncommerc

sanchit-gandhi commented 7 months ago

Copying the notes from an internal discussion between persons A-C. TLDR: the Distil-Whisper models inherit the permissive MIT license from Whisper.

Conversations notes

A: what's pretty sure is that the model license is independent from the dataset license

B: That's definitely true for clear cut cases, but we're also talking about who they're driving attention from and what contributes to making them or the model trainers a target - especially in the litigious US system where entities will start suing because they know they can hold out in court longer

B: BTW I also think that's why it makes sense for Adobe or CodeX to provide the indemnization clauses they have for copyright suits - getting more clients balances legal fees of fighting claims + it's more intimidating for people to come after them than to come after smaller entities It would be a lot riskier for us to provide similar guarantees even if we're fully confident in the outcome

C: Sorry, very naive question: given the model license is generally not tied to the dataset, in what case would a trained model be non-commercial? And by extension, could we go ahead and say already that we’re fairly certain the trained models will have permissive license?

B:

  1. Organizations like Creative Commons have indicated that they don't think training a commercial model is a commercial use of the dataset, but they've fallen short of saying it publicly.
  2. Model trainers still need to argue that training the model is Fair Use with regards to the copyright status of the underlying data - which depends on other factors. The case law is still being built for AI.
  3. The US Copyright office is currently doing consultations on AI and should provide more clarity in a few months

So TLDR from me is: permissive licenses on trained model are probably fine right now, but we don't want to be in a situation where anyone says "it's legal because HF told me so", and if we want to give advice to that effect I'd like to make sure it's carefully phrased

sanchit-gandhi commented 7 months ago

Closing this issue as per the above comment. Feel free to re-open or start a discussion if want any further clarification @souvikqb!