Voice of America Corpus?

patrickms commented 7 years ago

Something that I discovered a while ago working on a completely different project (teaching literacy), was a large public domain corpus which could be used for training speech recognition systems.

https://learningenglish.voanews.com/ is a public broadcasting service (a U.S. equivalent the BBC world service) has a large corpus of news articles along with human narration which can be crawled from their website.

As per U.S. copyright law, since Voice of America is a federal agency, the content created by Voice of America (which includes the news articles and narration, but not the narration) is in the public domain (see https://www.usa.gov/government-works). The Voice of America terms and conditions confirm this public domain copyright status: "All text, audio and video material produced exclusively by the Voice of America is in the public domain." (quoted from the VOA Learning English terms and conditions at https://learningenglish.voanews.com/p/6021.html).

I have previously produced tools which scrape and align text from VOA and will look into whether I can release them as open source (but I don't know if/when this will be the case). In the mean time I wanted to point this out in case it's useful to your project.

The VOA narrations are all read very clearly, targeting English Language Learners, which is not necessarily going to be representative of all TTS input. But given the size of the existing publicly available speech resources, I hope that VOA can add some useful training material.

kdavis-mozilla commented 7 years ago

This sounds like a great addition to our corpus!

I'll likely have to check with our legal team to see if we're in the clear for this data set.

If you can release the "scrape and align" code, please let us know!

patrickms commented 7 years ago

I'll let you know if I am able to release the code.

Please let me know what your legal team says about the data. As far as I could tell (as a non-lawyer), it's all in the public domain (except some of the pictures, since they sometimes come from Reuters etc), and there aren't any restrictions on what can be done with it. If my understanding about that isn't correct, please tell me.

kdavis-mozilla commented 7 years ago

Just looked a bit more into the VOA Copyright Statement, middle of this[1] page, and it is more complicated. It states

All text, audio and video material produced exclusively by the Voice of America is in the public domain. Credit for any use of VOA material should be given to voanews.com, Voice of America, or VOA. However, voanews.com content may also contain text, video, audio, images, graphics, and other copyrighted material that is licensed for use in VOA programming only. This material is not in the public domain and may not be copied, redistributed, sold, or published without the express permission of the copyright owner.

So it looks like the VOA content is indeed in the public domain. However, anywhere in the audio stream their might be "material that is licensed for use in VOA programming only" which is not in the public domain. The problem is differentiating the two.

Without explicit indications from VOA what is licensed for use in VOA programming only and what is not, I don't think its possible to use VOA content. So VOA would have to explicitly indicate for each audio stream what is/is not licensed for use in VOA programming only.

patrickms commented 7 years ago

Article and narrations by VOA editors are produced exclusively by Voice of America. Which in practice I believe covers all of their regular news content.

The licensed works are restricted mostly to images which they often get from Reuters...

kdavis-mozilla commented 7 years ago

@patrickms Could you point to a reference for "Article and narrations by VOA editors are produced exclusively by Voice of America" so I can point the legal team to the info? Thanks!

patrickms commented 7 years ago

In the about us page, it clarifies that

"Learning English texts, MP3s and videos are in the public domain. You are allowed to reprint them for educational and commercial purposes, with credit to learningenglish.voanews.com. VOA photos are also in the public domain. However, photos and video images from news agencies such as AP and Reuters are copyrighted, so you are not allowed to republish them."

Since for speech corpus building you would only be interested in text, audio, and possibly video, all the resources you might use from http://learningenglish.voanews.com/ are in the public domain.

kdavis-mozilla commented 7 years ago

Thanks!

kdavis-mozilla commented 4 years ago

Closing for lack of activity.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mozilla / DeepSpeech

Voice of America Corpus? #617