vsitzmann / siren

Official implementation of "Implicit Neural Representations with Periodic Activation Functions"
MIT License
1.72k stars 247 forks source link

Issues with training on audio (not a bug with this repo) #46

Closed lostmsu closed 3 years ago

lostmsu commented 3 years ago

I reimplemented Siren in TensorFlow 2.5. The network easily learns images, but I can not reproduce result with audio. On the sample file from the paper loss gets stuck at relatively high value (~0.0242), and network's output turns very quiet (max(abs(x)) ~= 0.012). Just curious if anyone has faced the same issue when reimplementing Siren on their own.

What I've tried so far:

  1. doublechecked omega - it is set to 3000.0 (input), 30.0, 30.0, 30.0 (inner) layers
  2. Changing batch size to full length of the sample (I used to do randomized batches of 8*1024)
  3. Using float64 to avoid potential issues with numerical overflows/underflows
  4. Checked network weights: all are finite numbers
  5. Using SGD as a more stable optimizer
  6. Increasing network width/adding more layers

Essentially, all the above actions still led to the same result with loss ~0.0242

schreon commented 3 years ago

I also experienced instability during training, until I just used a very small learning rate ( 1e-5 ) from start to finish. Then train for a lot of epochs, because the training is much slower due to the small learning rate. Did you try something like that already?

On Mon, Jun 7, 2021 at 8:47 AM Victor @.***> wrote:

I reimplemented Siren in TensorFlow 2.5. The network easily learns images, but I can not reproduce result with audio. On the sample file from the paper loss gets stuck at relatively high value (~0.0242), and network's output turns very quiet (max(abs(x)) ~= 0.012). Just curious if anyone has faced the same issue when reimplementing Siren on their own.

What I've tried so far:

  1. doublechecked omega - it is set to 3000.0 (input), 30.0, 30.0, 30.0 (inner) layers
  2. Changing batch size to full length of the sample (I used to do randomized batches of 8*1024)
  3. Using float64 to avoid potential issues with numerical overflows/underflows
  4. Checked network weights: all are finite numbers
  5. Using SGD as a more stable optimizer
  6. Increasing network width/adding more layers

Essentially, all the above actions still led to the same result with loss ~0.0242

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vsitzmann/siren/issues/46, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGFHLVVLXMUVKW2SDYJR53TRRTQZANCNFSM46G67GIA .

lostmsu commented 3 years ago

@schreon I used the learning rate from the paper: 5e-5.

But NVM, I figured why it was not training on audio and it was completely my fault: I set incorrect shuffling mode. In TensorFlow when you do model.fit by default data is not shuffled so I assume feeding the audio stream sequentially threw the optimizer off the course each time due to forgetting.

lostmsu commented 3 years ago

It also appears that you need to scale omega for the input layer for longer audios.

schreon commented 3 years ago

Yes. Did you find a good heuristic for scaling omega with differing input sizes yet? I believe we can scale it linearly per domain. For example, if you squeeze an audio of double size than the one in the paper into -1, 1 you will end up with double frequency, hence doubling omega to omega_input = 6000 would make sense. If this works consistently, we would only have to find one "base omega" for each domain once.

lostmsu commented 3 years ago

Yes, I noticed that. I wonder now if it makes sense to make omega itself a trainable parameter with log scale.