Early training bias due to random sampling with uneven # from sites or cameras

Another thought that I have been pondering with the active-learning pipeline is how to avoid biasing your detector to the types of images that you labeled first? For instance, say I have 10 cameras and each camera has taken a different number of images from 1000 to 100,000 over a 1-year deployment. If you label 100 randomly selected images to start with, the majority will be from the camera that took the most images, and maybe the background in that camera is distinct. if you train a model with those initial 100 images, it may be highly biased toward detecting things in images from that camera (because of some characteristic of those images). Images from other cameras might not even have detections and might not get "served" to the person tagging? Essentially I see it as the same idea as the class imbalance, but instead, it is an imbalance in the raw data. How/does this normally get addressed in active learning?

olgaliak / active-learning-detect

Early training bias due to random sampling with uneven # from sites or cameras #34