orcasound / orcaal

Orca detection Active Learning (Orca AL) tool -- https://ai4orcas.net/portfolio/orca-al/
https://orcasound.github.io/orcaal/
MIT License
7 stars 6 forks source link

Improving existing test suite #25

Open Benjamintdk opened 2 years ago

Benjamintdk commented 2 years ago

Existing tests only provide coverage for the API endpoints, as well as the ability to create Database objects using the model schemas. Adding specific tests for models such as behavioral, invariance or regression tests will help provide baseline sanity checks, in addition to performance metrics, for new models generated during re-training or developed from scratch. Some references for doing so are listed below:

yosoyjay commented 2 years ago

@Benjamintdk Yes, you are correct that the testing is very much focused on ensuring that the APIs and database objects work as designed.

Given what you've read, do you some specific tests that you think would be good for this specific application?

Benjamintdk commented 2 years ago

Hi @yosoyjay, I was looking through the Checklist paper and felt that a lot of ideas introduced there could be really relevant, even though the paper was about NLP specific cases. Some thoughts I had:

yosoyjay commented 2 years ago
  • Invariance tests can include mixing different types of “non-orca” noise such as vessel, water or other whale noise with “orca” ones, and then testing the model to ensure that it still predicts a positive label. This is somewhat similar to mixup augmentation used in other computer vision tasks.

Absolutely. This aligns nicely with some of the other potential tasks of implementing detectors for other noises common in the environment. What kinds of statistical tests do you think could be used to describe some of these so we can make it quantitative?

  • Directional expectation tests can include increasing or decreasing the amount of "orca" sounds in the examples, and ensuring that the model has correspondingly more confident and less confident predictions respectively. Could also combine examples with different orca sounds and probably expect more confident examples, and perhaps the converse when we shorten the duration of "orca" sounds (so if a clip has 3 seconds worth of "orca" sound, we can crop 1 second of it out and combine it with 2 other halves with "no orca" sound, and we should expect a drop in confidence).

Great ideas. These are exactly the kinds of test that one would implement to identify potential model drift. But, they could also be used in a diagnostic manner to help identify when the models don't perform so well.

Benjamintdk commented 2 years ago

Absolutely. This aligns nicely with some of the other potential tasks of implementing detectors for other noises common in the environment. What kinds of statistical tests do you think could be used to describe some of these so we can make it quantitative?

Hmm I'm not entirely certain about how to apply statistical tests in this context @yosoyjay. Are you perhaps referring to using an older and newer model to predict on these test cases, then performing hypothesis testing between the results obtained (could be accuracy, F1, etc.) by these 2 models, and determining if the difference is statistically significant and to be confident that the model performance has actually improved with re-training? Great ideas. These are exactly the kinds of test that one would implement to identify potential model drift. But, they could also be used in a diagnostic manner to help identify when the models don't perform so well.

Ahh I see. Something else that I'm potentially concerned about is deciding the number of test cases/examples to use, and as you mentioned about model drift, I guess that this has to be updated over time as well, unlike in traditional software engineering where unit tests can be more or less static.