tensorflow / models

Models and examples built with TensorFlow
Other
76.97k stars 45.79k forks source link

The result from audioset is different from YouTube8M offical code #9474

Closed gaoruiyang closed 1 year ago

gaoruiyang commented 3 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/audioset/vggish

2. Describe the bug

The result of this code is different from Youtube8M code according to link https://github.com/google/mediapipe/tree/master/mediapipe/examples/desktop/youtube8m

I read the paper and change line35 in vggish_params.py: from EXAMPLE_HOP_SECONDS = 0.96 # with zero overlap. to EXAMPLE_HOP_SECONDS = 1.0 (I'm not sure, but if I don't change it, the result between above two would be more different.)

I tried a wav file with no sound, it looks like the result of Youtube8M code, very close. But when I put a wav file with sound, it shows bad.

3. Steps to reproduce

ffmpeg -i demo.mp4 -f wav -ar 16000 demo1.wav ffmpeg -i demo.mp4 -f wav demo2.wav

This code: (1)python vggish_inference_demo.py --wav_file demo1.wav (2)python vggish_inference_demo.py --wav_file demo2.wav

Youtube8M code (just as the README.md in YouTube8M said): python -m mediapipe.examples.desktop.youtube8m.generate_input_sequence_example \ --path_to_input_video=demo.mp4 \ --clip_end_time_sec=120 (3)GLOG_logtostderr=1 bazel-bin/mediapipe/examples/desktop/youtube8m/extract_yt8m_features \ --calculator_graph_config_file=mediapipe/graphs/youtube8m/feature_extraction.pbtxt \ --input_side_packets=input_sequence_example=/tmp/mediapipe/metadata.pb \ --output_side_packets=output_sequence_example=/tmp/mediapipe/features.pb

4. Expected behavior

I expected to see the three methods, (1)(2)(3) show only one result, but I got confused that all of them are different from each other. (1) is similar with (2), but (1) and (2) are both different from (3) which shows no similarity. The result I choose from this code is "pca_applied" in vggish_postprocess.py line 75, which should be the same as the final result from YouTube8M code(no quantized).

5. Additional context

null

6. System information

gaoruiyang commented 3 years ago

Can anyone answer this question plz?

plakal commented 3 years ago

We're looking into it, but it might take us some time to reproduce the issue.

Note that we are moving towards getting rid of the embedding post-processing completely on our end, i.e., we would no longer apply the PCA or quantization and just provide a raw embedding which you can then post-process as you like. We haven't done that yet. But in the meantime, to help debugging, perhaps you could compare the raw VGGish output from our code (the 128-D output of last fully-connected layer, before any post-processing) and the "vggish_matrix" stream in the mediapipe graph https://github.com/google/mediapipe/blob/master/mediapipe/graphs/youtube8m/feature_extraction.pbtxt. If the raw VGGish outputs match, then I think we are done, and we can fix this by getting rid of postprocessing on our end and point people at the post-processing in YouTube-8M as an example.

gaoruiyang commented 3 years ago

Thank you for answering my question! Actually, I've tried this. I change the pbtxt file (line 247 to 278) into

node {
  calculator: "TensorToMatrixCalculator"
  input_stream: "REFERENCE:log_mel_spectrum_magnitude_with_context"
  input_stream: "TENSOR:vggish_tensor"
  output_stream: "MATRIX:vggish_matrix"
  node_options: {
    [type.googleapis.com/mediapipe.TensorToMatrixCalculatorOptions] {
      time_series_header_overrides {
        num_channels: 128
        num_samples: 1
      }
    }
  }
}

node {
  calculator: "MatrixToVectorCalculator"
  input_stream: "vggish_matrix"
  output_stream: "pca_vggish_vf"
}

I compare the result with the code in audioset (vggish_inference_demo.py line 121), sadly different (close but not enough). mediapipe: [[-0.0649360716342926 0.15292319655418396 0.17685073614120483 ... -0.21504393219947815 0.24558712542057037 0.15128350257873535] [-0.6392803192138672 0.02067597210407257 0.09340453147888184 ... -0.9527117013931274 -0.18397732079029083 0.0373111218214035] [-0.5412415266036987 -0.0742368996143341 0.18058447539806366 ... -0.8667834997177124 -0.13395510613918304 0.09358564019203186] ... [-0.9933363199234009 0.20506000518798828 0.5246546268463135 ... -1.2908496856689453 -0.11871233582496643 0.09388858079910278 ] [-0.8884260654449463 -0.01692095398902893 0.42346906661987305 ... -1.3252453804016113 -0.36304032802581787 0.10942845046520233] [-0.3542061746120453 0.05650855600833893 0.2131386697292328 ... -0.6611133813858032 -0.03726176917552948 -0.0018435120582580566]] audioset: [[-0.05297768 0.13873157 0.23349535 ... -0.21425793 0.18143931 0.07500704] [-0.58286357 0.0301965 0.1149358 ... -0.92747855 -0.1322018 0.03431585] [-0.45145878 -0.04883426 0.29334253 ... -0.8225013 -0.08002283 0.10850269] ... [-0.88727987 0.2280917 0.57952476 ... -1.1769722 -0.07860385 0.0640654 ] [-0.8278724 0.02014382 0.4345901 ... -1.2061305 -0.28000677 0.09429027] [-0.24750623 0.11636342 0.19648662 ... -0.5362742 0.01498885 -0.00770268]] I expect that the difference between each item should < 0.01, that will be easier to be accepted.

gaoruiyang commented 3 years ago

We're looking into it, but it might take us some time to reproduce the issue.

Note that we are moving towards getting rid of the embedding post-processing completely on our end, i.e., we would no longer apply the PCA or quantization and just provide a raw embedding which you can then post-process as you like. We haven't done that yet. But in the meantime, to help debugging, perhaps you could compare the raw VGGish output from our code (the 128-D output of last fully-connected layer, before any post-processing) and the "vggish_matrix" stream in the mediapipe graph https://github.com/google/mediapipe/blob/master/mediapipe/graphs/youtube8m/feature_extraction.pbtxt. If the raw VGGish outputs match, then I think we are done, and we can fix this by getting rid of postprocessing on our end and point people at the post-processing in YouTube-8M as an example.

Is there any comments plz? @plakal

plakal commented 3 years ago

I haven't had time yet to debug this in more detail.

My suspicion is that our VGGish embedding might be sensitive to the resampling so you will see differences between the two paths (1) our inference demo: you resample to wav using ffmpeg, we resample further using resampy, and then run inference (2) mediapipe: resampling is done via RationalFactorResampler, and then run inference

When I get some time to debug, I would try to see if the pure inference in python produces the same output as the inference in mediapipe, for the same waveform input without resampling.

If there's a difference in inference for the same input, then there is some bug in how mediapipe is freezing and running the VGGish model.

If there's no difference there, then the difference you're seeingis due to the different resampling paths. In that case, I don't think we can do anything further since we don't guarantee that VGGish will produce the same embedding in these cases. The intent behind VGGish is that it can be used to feed a classifier and such differences are not likely to be material to a downstream classifier.

Can you explain why you're being blocked by this? I.e., what breaks in your case because our inference demo program happens to produce a different output from what you get from mediapipe?

plakal commented 1 year ago

Apologies for the lack of updates, I haven't had much spare bandwidth to debug VGGish issues.

You're comparing ffmpeg -> scipy -> resampy -> numpy -> TF vs mp4 -> audio extraction -> mediapipe resampler -> mediapipe TF wrapper -> TF. I expect that the the core model itself is working the same in both cases but is seeing slightly different audio.

My inclination is to close this as working as intended. We don't guarantee identical outputs across different implementations of the input pipeline. The intention is to use the model to compute high-level semantic embeddings from audio and then use the embeddings in a classifier, and that should continue to work.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No