radioML / dataset

Open RadioML Synthetic Benchmark Dataset
244 stars 140 forks source link

Issue in the data normalization to unit energy #24

Open rutrilla opened 5 years ago

rutrilla commented 5 years ago

Hi there!

I've realized that in the dataset generation, the energy of the 128-sample data vectors is being normalized to unity as follows (lines 65 and 66):

energyNormalization

However, to the best of my knowledge, the energy Es of a discrete-time signal x(n) is defined mathematically as:

energyEq

Once you have calculated the Es, the sampled_vector must be divided by the square root of the energy, not only by the energy itself. In code, it should be something like this:

energy = np.sum(np.abs(sampled_vector) ** 2) sampled_vector = sampled_vector / math.sqrt(energy)

I've plotted both versions and these are the results.

Before:

beforeCorrection

After:

afterCorrection

Therefore the signals are being unnecessarily compressed, which can make it harder for (some) models to extract meaningful information, or even prevent it altogether.

Do my findings make sense to you or is there anything that I may have not understood properly? Please, check it and update us with your conclusions if you are so kind.

I look forward to hearing from you.

Regards,

Ramiro Utrilla

rutrilla commented 5 years ago

Actually, in addition to the previous energy normalization, what is really working for me is to scale the IQ samples between -1 and 1. So that's how this part of my code looks like:

energy = np.sum(np.abs(sampled_vector) ** 2)
sampled_vector = sampled_vector / math.sqrt(energy)
max_val = max(max(np.abs(sampled_vector.real)), max(np.abs(sampled_vector.imag)))
sampled_vector = sampled_vector / max_val

And that's the appearance of the signals after both normalization processes:

afterSecondCorrection

As far as I know this kind of normalization is pretty common as some models are more sensitive to the scale of the input data than others. Was there any reason not to originally do this in the dataset? Am I missing something?

It'd be great if someone could give further details on the best practices for normalizing this kind of data.

Regards,