Implementation of audio chunking

Thanks for your question!

The audio chunking mechanism in dm.py is designed to handle audio and motion data in a synchronized manner, considering their respective sampling rates and frame rates.

For clarity:

The audio is processed with a sample rate of 16,000 Hz, and the chunking is done using total_chunks = waveform.shape[1] // (16000 * 10), which segments the audio into 10-second chunks.
On the other hand, the motion data is processed at 30 FPS, with a train_pose_framelen parameter of 300, which represents a chunk of 10 seconds (30 FPS × 10 seconds = 300 frames).

Therefore, the alignment of the audio features (c, e, s) with the motion chunks is maintained as both audio and motion data are segmented into 10-second intervals. This alignment ensures that the features computed from audio and the corresponding motion frames are consistent with each other.

kiranchhatre / amuse

Implementation of audio chunking #8