gitmehrdad / FACE

Urban Sound Annotation and Classification
GNU General Public License v3.0
8 stars 4 forks source link

According to the implementation, this research has not considered the data-splitting constants mentioned in the URBANSOUND8K DATASET. #1

Closed chamathsilva closed 5 months ago

chamathsilva commented 1 year ago

Under the URBANSOUND8K DATASET, they have specifically mentioned the following.

BEFORE YOU DOWNLOAD: AVOID COMMON PITFALLS! Since releasing the dataset we have noticed a couple of common mistakes that could invalidate your results, potentially leading to manuscripts being rejected or the publication of incorrect results. To avoid this, please read the following carefully:

  1. Don't reshuffle the data! Use the predefined 10 folds and perform 10-fold (not 5-fold) cross validation The experiments conducted by vast majority of publications using UrbanSound8K (by ourselves and others) evaluate classification models via 10-fold cross validation using the predefined splits*. We strongly recommend following this procedure.

Why? If you reshuffle the data (e.g. combine the data from all folds and generate a random train/test split) you will be incorrectly placing related samples in both the train and test sets, leading to inflated scores that don't represent your model's performance on unseen data. Put simply, your results will be wrong. Your results will NOT be comparable to previous results in the literature, meaning any claims to an improvement on previous research will be invalid. Even if you don't reshuffle the data, evaluating using different splits (e.g. 5-fold cross validation) will mean your results are not comparable to previous research.

  1. Don't evaluate just on one split! Use 10-fold (not 5-fold) cross validation and average the scores We have seen reports that only provide results for a single train/test split, e.g. train on folds 1-9, test on fold 10 and report a single accuracy score. We strongly advise against this. Instead, perform 10-fold cross validation using the provided folds and report the average score.

Why? Not all the splits are as "easy". That is, models tend to obtain much higher scores when trained on folds 1-9 and tested on fold 10, compared to (e.g.) training on folds 2-10 and testing on fold 1. For this reason, it is important to evaluate your model on each of the 10 splits and report the average accuracy. Again, your results will NOT be comparable to previous results in the literature.

More details : https://urbansounddataset.weebly.com/urbansound8k.html

gitmehrdad commented 1 year ago

Thank you for your comment. While the mentioned issue is meaningless in the Audio Annotation task, a 10-fold cross-validation is necessary to compare a proposed Audio Classifier with similar prior works. But, as it is clear from our paper and our report in the paperswithcode.com, we never intended to compare our results with works establishing 10-fold cross-validation. The reason behind that is our work focuses on the demonstration of using a Context-aware approach to solve Audio Annotation and Classification Tasks.

chamathsilva commented 1 year ago

Even though this work doesn't want to compare the results with works establishing 10-fold cross-validation, the dataset explicitly mentions the following point, which should be taken into consideration:

"If you reshuffle the data (e.g., combine the data from all folds and generate a random train/test split), you will be incorrectly placing related samples in both the train and test sets. This can lead to inflated scores that do not accurately represent your model's performance on unseen data. In other words, your results may be misleading or incorrect."

It is crucial to understand the importance of preserving the integrity of the data during the train/test split process. Failing to do so can result in erroneously including related samples in both the training and testing sets, which can artificially inflate the performance scores of the model.

sulaimanvesal commented 5 months ago

I have a similar comments here, how can you make sure there is no leakage between the validation and training sets considering extracting and learning features from all the dataset.

gitmehrdad commented 5 months ago

Thank you for your comment. As explained in the paper and evident in the implemented code, the test set, validation set, and the trainset are entirely distinct, with no overlap among them. Regarding the preservation of default dataset datafolds, it depends on the objective: if the aim is to address the issue of general classification, then yes; however, if the goal is to tackle the problem of context-aware classification, then no. Default folds introduce a challenging corner case, which can be beneficial for addressing the classification problem. However, when employing a context-aware method, it's essential to examine the normal case rather than the corner case, as a result of the Central Limit Theorem. In fact, my paper aims to demonstrate how solving machine-learning problems through a context-aware approach can enhance problem-solving accuracy.