Closed kasireddygariDineshKumarReddy closed 2 years ago
@kasireddygariDineshKumarReddy z(t)
is of dimension d
. In config file, EMB_SZ
defines d
. https://github.com/mimbres/neural-audio-fp/blob/058d812df3787a7e000c6f595e200fd2e15ee348/config/default.yaml#L47
Do you mean each unit segment(lets say 1second of audio) is of dimension 128 or d
Yes d=128.
In NFP algorithm ,it was given that Zk^(org) = g ◦ f (Sk) Zk^( rep) = g ◦ f (M(Sk )) and after loop completion Z= {Z1^(org) , Z1^(rep) , ..., Z N/2^(org), Z N/2^(rep)} Is Zk^(org) ,Zk^(rep) of 128 dimension or else Z which is combination of all these originals and replicas is of dimension 128?
Z^k(*) is kth single element in training batch, and it has a shape (128,).
Z will have a shape (B, 128) where B is training batch size.
Is Agumentation performed before feature extraction or after log mel spectrogram feature extraction?
Most of the augmentations, such as mixing background noise, applying IR filters, and mixing speech (not covered in the paper) are processed in time-domain. In spectral domain, see more details: https://github.com/mimbres/neural-audio-fp/tree/main/model/fp/specaug_chain
We generate segment-wise embeddings zt∈Z that can represent a unit segment of audio from the acoustic features S at time step t. In this line do each zt is of dimension d or dimension 1.