jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
779 stars 43 forks source link

Question about Audio Preprocessing #46

Open xjf-303 opened 2 weeks ago

xjf-303 commented 2 weeks ago

Hello, @jishengpeng thank you for the amazing work. May I ask several questions:

I was going through the paper and noticed that during the data preprocessing, audio is first cropped to a fixed length of 10 seconds and then randomly cropped again to obtain 3-second segments. I have a couple of questions regarding this process:

1.Could you explain the rationale behind first cropping the audio to 10 seconds and then performing another random crop to 3-second segments? How does this impact the model's performance or training?

2.Are there any overlaps between the cropped segments, or are they entirely distinct?

3.If possible, could you please share the code for this part of the data preprocessing pipeline?

Thank you for your time and consideration! Looking forward to your response.

jishengpeng commented 2 weeks ago

Hello, @jishengpeng thank you for the amazing work. May I ask several questions:

I was going through the paper and noticed that during the data preprocessing, audio is first cropped to a fixed length of 10 seconds and then randomly cropped again to obtain 3-second segments. I have a couple of questions regarding this process:

1.Could you explain the rationale behind first cropping the audio to 10 seconds and then performing another random crop to 3-second segments? How does this impact the model's performance or training?

2.Are there any overlaps between the cropped segments, or are they entirely distinct?

3.If possible, could you please share the code for this part of the data preprocessing pipeline?

Thank you for your time and consideration! Looking forward to your response.

Thank you for your attention.

  1. The data was randomly segmented into 10-second intervals, and this selection was part of a stochastic process. We opted for a value that appeared to be reasonable based on empirical considerations.
  2. Since some audio recordings can be as long as 5 minutes, we segmented them into 30 distinct 10-second clips.
  3. The corresponding code implementation is relatively concise, consisting of approximately 30 lines. It calculates the sampling points based on the 10-second interval and the sampling rate, producing the desired audio segments.