Discern music from spoken word

🚀 The feature

I'm wondering if there are any researchers out there that can search an audio stream like an mp3 and determine whether or not the track is purely spoken word versus a song or music? I can think of a number of potential techniques (such as phonetic search) that have varying levels of accuracy. Perhaps there are ffmpeg scripts out there that I might not be aware of.

Motivation, pitch

I am working on a project wherein I generate a folder of mp3 tracks, the tracks are either spoken word or music (never both) and I simply want to separate the music from the spoken word without having to listen to each track.

Alternatives

I don't believe there are any viable alternative solutions other than listening to each track

Additional context

I've done a bunch of research on phonetics and phonetic search specifically. I haven't been able to find any projects that focus on this feature specifically. A thought is being able to discern the presence a specific instrument (in almost every case, either a piano and/or drums are playing.)

I should specify that none of the music is under any licensing constraints. Nor is it any music that can be matched to existing fingerprints for known songs.

pytorch / audio