Open Novicei opened 2 years ago
Hi @Novicei, thanks for good questions (sorry for the late reply). Actually I am interested in improving the evaluation method, too.
I would like to know if the 500 test data you mentioned in your paper were randomly selected?Did you choose it only once? Or take the average of multiple times? Is your database set up to ensure that it contains 500 test data?
Selecting 500 test-songs was fixed (available in dataset release; isolated from train-songs). 2,000 test-queries (test_ids) were fixed, selected from the 500 test-songs. So they were randomly sampled only once.
Because in general, in the audio query task, there should be a difference between the recall rate and the precision rate. I understand that the accuracy rate in your paper is the precision rate.
The evaluation metric I used was Top1 hit rate(%) defined as:
In our setup, every query always has a single relevant original segment or sequence. Our DB is constructed as {test_500 + dummy_100K}.
As you mentioned, this can be thought of as Precision@1 where FP (false-positives) is # of retrieved irrelevant segments. Going deeper, it depends on how to define threshold. Assuming rank-threshold, it is exactly P@1. Furthermore, it would be possible to define a certain type of threshold (e.g. prob > 0.5) to calculate precision and recall.
I'm using a program implemented by someone else in pytorch, have you checked the program's implementation and results? When I tested, I found that the accuracy of 1-second and 2-second query tests fluctuated greatly with different data. Did you find this out?
I read the code, but haven't run it yet. But yes. The strong fluctuation may happen in my implementation, too.
And I think that there is also the possibility of muted segments for 1-second and 2-second queries, which will have a greater impact on the results. Therefore, your 1-second and 2-second accuracy rates are as stated in your paper. Is it right? Or is it more volatile?
Regardless of 1s or 2s input, I agree that such a silence or highly repetitive segment can greatly affect the test accuracy.
unstable result
: Unfortunately,the reproduced results (including this repo) are different from the paper's report. This inconsistency comes from the selection of test set and other reasons as discussed in #18. I might consider writing a note to account for this issue.1s vs 2s
: I haven't tried 2s
yet with this repo. We can discuss the issue after seeing the '2s' result based on this repo. I will share the result soon.Thanks for your reply, it cleared up a lot of my confusion.But i still have a question,as the other question said,there seems to be some problem with the way your query is generated.First, since the offset is only 1.2, it cannot cover the full query use case; then, the 2-10 second generation is combined with 1 second.I have tried the use case generation method in the pytorch implementation. The exact 1 second is about 52%, and the exact 10 seconds is only about 88%. Do you think this is normal?
Thanks for your reply, it cleared up a lot of my confusion.But i still have a question,as the other question said,there seems to be some problem with the way your query is generated.First, since the offset is only 1.2, it cannot cover the full query use case; then, the 2-10 second generation is combined with 1 second.I have tried the use case generation method in the pytorch implementation. The exact 1 second is about 52%, and the exact 10 seconds is only about 88%. Do you think this is normal?
Oh, I'm using a model trained by pytorch. This model produces similar results if use a similar query fragment generation method to yours.
Did you build the database with 100,000 complete music? I thought you were using 100k 30 second snippets.
@Novicei Thanks for the questions:) I believe this discussion will help provide a more accurate baseline for other follow-up research.
First, since the offset is only 1.2, it cannot cover the full query use case; then, the 2-10 second generation is combined with 1 second. First, since the offset is only 1.2, it cannot cover the full query use case; I have tried the use case generation method in the pytorch implementation.
The difference of test setup between this repo and the PyTorch implementation is discussed in #14.
1s segment test
, the random offset is +-0.2s
with hop=0.5s
. So it covers 80% of possible starting points in the dataset. The exact 1 second is about 52%, and the exact 10 seconds is only about 88%. Do you think this is normal?
It's hard to say without testing with the same song queries. Perhaps, the uniform offset (of PyTorch implementation) could be harder task. If so, it may explain how @stdio2016 boosted the performance by test-time offset modulation. But again, it's hard to conclude this without using the same test setup.
Did you build the database with 100,000 complete music? I thought you were using 100k 30 second snippets
For test, the dummy_100k_full in the dataset consists of full-length songs. 30s snippets are only used for training. And of course, there's no overlap between train/test datasets.
TODO:
Sorry I mean, each test query is N-seconds of contiguous samples from test music, not N uniform offsets from [0s, 0.5s).
@stdio2016 Got it. Just in my implementation, it is equivalent to N uniform offsets for the query.
Hello, I would like to know if the 500 test data you mentioned in your paper were randomly selected?Did you choose it only once? Or take the average of multiple times? Is your database set up to ensure that it contains 500 test data?Because in general, in the audio query task, there should be a difference between the recall rate and the precision rate. I understand that the accuracy rate in your paper is the precision rate. I'm using a program implemented by someone else in pytorch, have you checked the program's implementation and results? When I tested, I found that the accuracy of 1-second and 2-second query tests fluctuated greatly with different data. Did you find this out? And I think that there is also the possibility of muted segments for 1-second and 2-second queries, which will have a greater impact on the results. Therefore, your 1-second and 2-second accuracy rates are as stated in your paper. Is it right? Or is it more volatile?