kahst / BirdNET-Analyzer

BirdNET analyzer for scientific audio data processing.
Other
839 stars 154 forks source link

Error while trying to change signal length in config.py #470

Open AdityaPanigrahy opened 5 days ago

AdityaPanigrahy commented 5 days ago

Hello BirdNET team,

I have been trying to change the prediction window length from 3 to 5 seconds for quite some time now. This is "SIG_LENGTH" in the config.py script. The sample rate stays at 48 kHz, and no prediction overlap. This gets down to 240,000 samples per chunk. For a 10-minute audio recording, about 120 chunks are produced. I verified the getRawAudioFromFile() function in the analyze.py script, and the 5-sec windows are produced correctly with the correct file offset. However, the issue arises when I try to run any analyze.py command. I continuously get this error.

Error: _/tensorflow/tensorflow/lite/util.cc BytesRequired number of elements overflowed. Node number [number] (CONV_2D) failed to prepare. The sizesplits must sum to the dimension of value along the axis.

Chunk shape before prediction: (1,240000) These errors seem to indicate that the shape of the input tensor does not match the expected dimensions of the model, and it may also suggest that the tensor is too large for the available memory. Also the FILE_SPLITTING_DURATION=600, could that be an issue with changing the value of SIG_LENGTH?

Can anyone with more technical expertise help me with what I am missing out? Additionally, could I get the idea for the expected model's input shape and any recommendations for adjusting the processing logic to ensure compatibility?

MacJudge commented 4 days ago

As far as I know, the model is trained to work with 3 seconds audio signal at 48 kHz and cannot make predictions for anything else. Changing the signal chunk length back to 3 seconds should make your error go away.

If you think your hardware cannot handle so many chunks at once (regarding memory), you could split your audio recordings before you feed them to the analyzer.

AdityaPanigrahy commented 4 days ago

Thanks for the reply. I agree with you totally. 3 second window length works. But the bird that we are trying to study has quite a complex repertoire, making it difficult to get a number of vocalizations from long-term datasets. Hence an attempt to get varying detection window lengths.

I also realized something, if the expected input shape is (1,144000), i.e., 48000 * 3, the default. We can change our sample rate, say 28.8kHz, for detections like 5s (which gives the same 144000 samples). That worked for the general classifier but not for the custom-trained ones (still troubleshooting). But again, a lower sample rate has its own repercussions; it may not train on really high frequencies, but it should be enough to capture between 1 and 10kHz (target vocalizations).

So I wanted to know if the tensor model input shape, i.e. (1,144000), is alterable or if this is what the model stays fixed at.

Mattk70 commented 4 days ago

This has been discussed before: https://github.com/kahst/BirdNET-Analyzer/issues/288#issuecomment-2063839251. The ideas of varying sample rate didn’t crop up, it’s an interesting one, but note that the model discards audio above 15KHz. If you fed it 5seconds @ 28.8KHz, it would discard sounds over 9KHz

tphakala commented 4 days ago

With currently latest model v2.4 input length is fixed to 3 seconds, but this video metions that variable input length is planned for v3.0 model https://youtu.be/Faavvmi9JZw?t=798

AdityaPanigrahy commented 3 days ago

Understood, I had missed the earlier discussion. Thanks a lot for the help! Looking forward to v3.0 model.