Training data - Githubissues

cdhasselerharm commented 5 years ago

I've been working on training my own model using camera trap images from 3 different sites with similar species, giving about 240,000 images of 25 different groups/species. When I tested the model on images from the same site as one of those used in training but from different cameras, the model prediction accuracy was very low. 99% of the predictions were dominated by the two groups that I had the most training data for, but even for test images of those two groups, classification had low accuracy. The same thing occurred when I tested the model on a random subset of images (30 of each species) that were used for training the model so there is a serious issue with my model.

So my questions are: Was each training image used in the manuscript identified in isolation? I'm thinking this may be the main factor affecting my results, as often in a sequence of images the animal often moves predominately out of shot and is unrecognizable based on that image alone but is identified due to the previous image.

My other question relates to rare species for which you have a low number of training images (e.g. < 1,000). Is it best to simply exclude rare species when building a training model?

Thanks for your help in advance!

mikeyEcology commented 5 years ago

1) We also used a sequence of images in the paper so that (for example) only a leg or snout was visible in an image and the only way the human knew it was that animal was because of the sequence. The model was able to perform well on these images.

2) We excluded species (or groups of species) for which we had fewer than 2,000 images. Recall (accuracy) decreases with dataset size (see Fig. 3 here). It isn't surprising that 30 images of each species does not work for training a model. One thing that might work for a small sample size is to try increasing the depth of your neural network. You can do this with the argument depth in train.

cdhasselerharm commented 5 years ago

OK, that's good to know. It also doesn't seem like your model is affected by large imbalances in training data.
Thanks. I will remove the rare species and try increasing the depth. Apologies if I wasn't clear. What I meant was that I ran classify on a subset of the same images used in training the model (30 of each species) and that gave similar poor results.

On Tue., 9 Apr. 2019, 8:44 pm mikey_t, notifications@github.com wrote:

1.

We also used a sequence of images in the paper so that (for example) only a leg or snout was visible in an image and the only way the human knew it was that animal was because of the sequence. The model was able to perform well on these images. 2.

We excluded species (or groups of species) for which we had fewer than 2,000 images. Recall (accuracy) decreases with dataset size (see Fig. 3 here) https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13120. It isn't surprising that 30 images of each species does not work for training a model. One thing that might work for a small sample size is to try increasing the depth of your neural network. You can do this with the argument depth in train.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mikeyEcology/MLWIC/issues/21#issuecomment-481200302, or mute the thread https://github.com/notifications/unsubscribe-auth/AuE_mguBZNgYIoFHOlbkcwANtuuzrhAoks5vfG8hgaJpZM4cjtVj .

mikeyEcology commented 5 years ago

I don't know that we can conclude that it's not affected. The species with fewer images have lower accuracies, but this could be because their isn't enough data to train the model more than due to imbalances.
I understand better now. 30 of each species for testing (classifying) is also a pretty low number to assess recall for that species. The model is likely not going to perform well with small datasets: this is a limitation of the method. But removing those species that have few images will help.

mikeyEcology / MLWIC

Training data #21