mimbres / neural-audio-fp

https://mimbres.github.io/neural-audio-fp
MIT License
179 stars 25 forks source link

Dimension of Zt #32

Closed kasireddygariDineshKumarReddy closed 2 years ago

kasireddygariDineshKumarReddy commented 2 years ago

We generate segment-wise embeddings zt∈Z that can represent a unit segment of audio from the acoustic features S at time step t. In this line do each zt is of dimension d or dimension 1.

mimbres commented 2 years ago

@kasireddygariDineshKumarReddy z(t) is of dimension d. In config file, EMB_SZ defines d. https://github.com/mimbres/neural-audio-fp/blob/058d812df3787a7e000c6f595e200fd2e15ee348/config/default.yaml#L47

kasireddygariDineshKumarReddy commented 2 years ago

Do you mean each unit segment(lets say 1second of audio) is of dimension 128 or d

mimbres commented 2 years ago

Yes d=128.

kasireddygariDineshKumarReddy commented 2 years ago

In NFP algorithm ,it was given that Zk^(org) = g ◦ f (Sk) Zk^( rep) = g ◦ f (M(Sk )) and after loop completion Z= {Z1^(org) , Z1^(rep) , ..., Z N/2^(org), Z N/2^(rep)} Is Zk^(org) ,Zk^(rep) of 128 dimension or else Z which is combination of all these originals and replicas is of dimension 128?

mimbres commented 2 years ago

Z^k(*) is kth single element in training batch, and it has a shape (128,).
Z will have a shape (B, 128) where B is training batch size.

kasireddygariDineshKumarReddy commented 2 years ago

Is Agumentation performed before feature extraction or after log mel spectrogram feature extraction?

mimbres commented 2 years ago

Most of the augmentations, such as mixing background noise, applying IR filters, and mixing speech (not covered in the paper) are processed in time-domain. In spectral domain, see more details: https://github.com/mimbres/neural-audio-fp/tree/main/model/fp/specaug_chain