vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

Audio Data Loading [FEATURE-REQUEST] #932

Open eladmw opened 4 years ago

eladmw commented 4 years ago

Hello, Considering your amazing efficiency on pandas, numpy, and more, it would seem to make sense for your module to work with even bigger data, such as Audio (for example .mp3 and .wav). This is something that would help a lot considering the nature audio (ie. where one of the lowest and most common sampling rates is still 44,100 samples/sec). For a use case, I would consider vaex.open('HugeData.mp3')

JovanVeljanoski commented 4 years ago

Hi,

Thanks for opening this issue. This is an interesting idea indeed. I have to say right from the start that I am not at all familiar with how one works with audio data in Python.

Vaex is mostly useful for working with data that does not fit in memory (like 10s or 100s of GB of data or more). But out of curiosity I just managed to open a random song that I have in mp3 format and I saw that it has two channels and 13 million samples.. so maybe there is a proper usecase here :).

So, given that the mp3 files are in general small (i.e. they it in RAM), it should be rather easy to convert them from numpy/pandas to a vaex dataframe and continue working from there. Once in a pandas dataframe format, you may choose to export them to HDF5 or Arrow, for convenience.

As an example here is a simple function that opens / converts a mp3 file into a vaex dataframe. I basically copied the code from this stackoverflow question:

def vaex_open_mp3(file, normalized=False, convert=False):
    """Open MP3 file as a Vaex DataFrame"""
    a = pydub.AudioSegment.from_mp3(file)
    y = np.array(a.get_array_of_samples())

    # Adjust for the number of channels
    y = y.reshape((-1, a.channels))
    if normalized:
        y = np.float32(y) / 2**15

    # Put in dict for convenience
    d =  {f'channel{i}': y[:, i]for i in range(y.shape[1])}

    # Now read in as a Vaex DataFrame
    df = vaex.from_dict(d)
    # Add the frame rate as a variable
    df.variables['frame_rate'] = a.frame_rate

    # If you want to fully exploit vaex
    if convert:
        df.export_hdf5(file[:-3] + 'hdf5')

    # Done
    return df
image image

Of course, if you know of better/faster ways of opening mp3 files with numpy or pandas, it should be trivial to convert them to a vaex dataframe.

Let me know if this helps at all.

maartenbreddels commented 4 years ago

Interesting idea, I think Vaex could be a great candidate for processing large amounts of audio. Ideally we'd do the decoding lazily. If anyone wants to pick this up, we could assist here.