rohitrango / objects-that-sound

Unofficial Implementation of Google Deepmind's paper `Objects that Sound`
82 stars 16 forks source link

Self-supervision but actually using labels #4

Open pmorerio opened 5 years ago

pmorerio commented 5 years ago

Hi, the work should rely on self supervision, where no labels are used for training, as stated in the README:

We DO NOT use the labels of the videos in any way during training, and only use them for evaluation.

However, labels are actually used in chosing the negative pairs for the contrastive loss. This is either a bug or not fair. However I reckon choosing negative pairs randomly should not make much difference, given the high number of classes (approx p=1/50 of taking a wrong pair).

rohitrango commented 5 years ago

The paper (and we) interpret it as NOT using discriminative labels like "guitar", "piano", etc. and still let the neural net figure out the semantics on its own. Note that there is no clear boundary between instruments and some videos have multiple sound sources too, making the task more difficult.

I hope this answers your question. If you have any questions please let me know.

pmorerio commented 4 years ago

Hi, I respectfully disagree. Self-supervised methods are engineered precisely to avoid labeling. If you use labels this goes against the spirit of self-supervision. Even if labels are not used directly to optimize the loss, this does not mean there is no label supervision. This is simply a different kind of supervision, where labels are used in a different way (i.e. to choose the pairs). In the paper they actually state:

The labels for the positives (matching) and negatives (mismatched) pairs are obtained directly, as videos provide an automatic alignment between the visual and the audio streams – frame and audio coming from the same time in a video are positives, while frame and audio coming from different videos are negatives.

This seem to suggest that negative pairs are simply taken from a different video with the need of no labels.