Closed Yo-Hsin closed 4 months ago
Thanks for your question!
The audio chunking mechanism in dm.py
is designed to handle audio and motion data in a synchronized manner, considering their respective sampling rates and frame rates.
For clarity:
total_chunks = waveform.shape[1] // (16000 * 10)
, which segments the audio into 10-second chunks.train_pose_framelen
parameter of 300, which represents a chunk of 10 seconds (30 FPS × 10 seconds = 300 frames).Therefore, the alignment of the audio features (c, e, s) with the motion chunks is maintained as both audio and motion data are segmented into 10-second intervals. This alignment ensures that the features computed from audio and the corresponding motion frames are consistent with each other.
Thanks for your work and the code release.
I'd like to ask about the audio chunking mechanism in dm.py. When processing the cache, it seems that the 3 features extracted from the audio clip (Line 597) are not aligned with the motion chunk (Line 641). Is this implementation designed for special purpose? Additionally, is the provided constructed by using this implementation?
Looking forward to your reply, thanks.