Questions about inquiries

Novicei commented 2 years ago

Hello, I would like to know if the 500 test data you mentioned in your paper were randomly selected?Did you choose it only once? Or take the average of multiple times? Is your database set up to ensure that it contains 500 test data?Because in general, in the audio query task, there should be a difference between the recall rate and the precision rate. I understand that the accuracy rate in your paper is the precision rate. I'm using a program implemented by someone else in pytorch, have you checked the program's implementation and results? When I tested, I found that the accuracy of 1-second and 2-second query tests fluctuated greatly with different data. Did you find this out? And I think that there is also the possibility of muted segments for 1-second and 2-second queries, which will have a greater impact on the results. Therefore, your 1-second and 2-second accuracy rates are as stated in your paper. Is it right? Or is it more volatile?

mimbres commented 2 years ago

Hi @Novicei, thanks for good questions (sorry for the late reply). Actually I am interested in improving the evaluation method, too.

I would like to know if the 500 test data you mentioned in your paper were randomly selected?Did you choose it only once? Or take the average of multiple times? Is your database set up to ensure that it contains 500 test data?

Selecting 500 test-songs was fixed (available in dataset release; isolated from train-songs). 2,000 test-queries (test_ids) were fixed, selected from the 500 test-songs. So they were randomly sampled only once.

Pros: It was simple and useful at the early stage of development without a fast search algorithm.
Cons: Lack of statistical significance. Possible improvement would be random-testing N times with large N and displaying mean (+- std. deviation) %.

Because in general, in the audio query task, there should be a difference between the recall rate and the precision rate. I understand that the accuracy rate in your paper is the precision rate.

The evaluation metric I used was Top1 hit rate(%) defined as:

$100\times \frac{\text{(n of hits @Top-1)}}{(\text{n of hits @Top-1) (n of miss @Top-1})}$

In our setup, every query always has a single relevant original segment or sequence. Our DB is constructed as {test_500 + dummy_100K}.

As you mentioned, this can be thought of as Precision@1 where FP (false-positives) is # of retrieved irrelevant segments. Going deeper, it depends on how to define threshold. Assuming rank-threshold, it is exactly P@1. Furthermore, it would be possible to define a certain type of threshold (e.g. prob > 0.5) to calculate precision and recall.

I'm using a program implemented by someone else in pytorch, have you checked the program's implementation and results? When I tested, I found that the accuracy of 1-second and 2-second query tests fluctuated greatly with different data. Did you find this out?

I read the code, but haven't run it yet. But yes. The strong fluctuation may happen in my implementation, too.

And I think that there is also the possibility of muted segments for 1-second and 2-second queries, which will have a greater impact on the results. Therefore, your 1-second and 2-second accuracy rates are as stated in your paper. Is it right? Or is it more volatile?

Regardless of 1s or 2s input, I agree that such a silence or highly repetitive segment can greatly affect the test accuracy.

In fact, the majority of genres in FMA dataset is experimental music (~50%) which contains a lot of such cases. One evidence is that I have experienced larger than 7% boost of performance while reproducing it with the updated testset.
unstable result: Unfortunately,the reproduced results (including this repo) are different from the paper's report. This inconsistency comes from the selection of test set and other reasons as discussed in #18. I might consider writing a note to account for this issue.
1s vs 2s: I haven't tried 2s yet with this repo. We can discuss the issue after seeing the '2s' result based on this repo. I will share the result soon.

Novicei commented 2 years ago

Thanks for your reply, it cleared up a lot of my confusion.But i still have a question,as the other question said,there seems to be some problem with the way your query is generated.First, since the offset is only 1.2, it cannot cover the full query use case; then, the 2-10 second generation is combined with 1 second.I have tried the use case generation method in the pytorch implementation. The exact 1 second is about 52%, and the exact 10 seconds is only about 88%. Do you think this is normal?

Novicei commented 2 years ago

Thanks for your reply, it cleared up a lot of my confusion.But i still have a question,as the other question said,there seems to be some problem with the way your query is generated.First, since the offset is only 1.2, it cannot cover the full query use case; then, the 2-10 second generation is combined with 1 second.I have tried the use case generation method in the pytorch implementation. The exact 1 second is about 52%, and the exact 10 seconds is only about 88%. Do you think this is normal?

Oh, I'm using a model trained by pytorch. This model produces similar results if use a similar query fragment generation method to yours.

Novicei commented 2 years ago

Did you build the database with 100,000 complete music? I thought you were using 100k 30 second snippets.

mimbres commented 2 years ago

@Novicei Thanks for the questions:) I believe this discussion will help provide a more accurate baseline for other follow-up research.

First, since the offset is only 1.2, it cannot cover the full query use case; then, the 2-10 second generation is combined with 1 second. First, since the offset is only 1.2, it cannot cover the full query use case; I have tried the use case generation method in the pytorch implementation.

The difference of test setup between this repo and the PyTorch implementation is discussed in #14.

In this repo with 1s segment test, the random offset is +-0.2s with hop=0.5s. So it covers 80% of possible starting points in the dataset.
However, in this repo (and the paper) for the longer than 1s queries, each segment has different offset. @stdio2016 pointed out that applying a uniform offset to each segment would be more realistic. I agree with his thoughts, but I haven't updated that yet, and so did the experiments in the paper.

The exact 1 second is about 52%, and the exact 10 seconds is only about 88%. Do you think this is normal?

It's hard to say without testing with the same song queries. Perhaps, the uniform offset (of PyTorch implementation) could be harder task. If so, it may explain how @stdio2016 boosted the performance by test-time offset modulation. But again, it's hard to conclude this without using the same test setup.

Did you build the database with 100,000 complete music? I thought you were using 100k 30 second snippets

For test, the dummy_100k_full in the dataset consists of full-length songs. 30s snippets are only used for training. And of course, there's no overlap between train/test datasets.

mimbres commented 2 years ago

TODO:

[ ] training '2s' segment model and reporting the test result

stdio2016 commented 2 years ago

Sorry I mean, each test query is N-seconds of contiguous samples from test music, not N uniform offsets from [0s, 0.5s).

mimbres commented 2 years ago

@stdio2016 Got it. Just in my implementation, it is equivalent to N uniform offsets for the query.

mimbres / neural-audio-fp

Questions about inquiries #24