giddyyupp / coco-minitrain

a subset of coco dataset for faster experimentation
230 stars 32 forks source link

Effect of run_count on stats/distribution? #17

Closed bryanbocao closed 2 years ago

bryanbocao commented 2 years ago

The default value of run_count is 10M and it takes a long time for me to sample. Does it only affect to what extent the stats/distribution of sampled train dataset matches the original train2017? If I set run_count to be 10, should I assume that the stats/distribution would not be affected too much? Thanks!

sholevs66 commented 2 years ago

The default value of run_count is 10M and it takes a long time for me to sample. Does it only affect to what extent the stats/distribution of sampled train dataset matches the original train2017? If I set run_count to be 10, should I assume that the stats/distribution would not be affected too much? Thanks!

How much time does it take you to go over the 10M?

Did you happen to understand whether this has any effect? Would like to know that too as for me also, running the sampling script takes a lot of time...

giddyyupp commented 2 years ago

The default value of run_count is 10M and it takes a long time for me to sample. Does it only affect to what extent the stats/distribution of sampled train dataset matches the original train2017? If I set run_count to be 10, should I assume that the stats/distribution would not be affected too much? Thanks!

If you set the run_count to 10, highly likely the distributions will not much. Basically, the more you sample, the highest the chances are for getting similar distributions in the sampled set and train set. Maybe we could add support for multi threaded sampling, it will reduce the run time.

bryanbocao commented 2 years ago

@sholevs66 I tested one month ago, as far as I remember, it waited for 10~20min but it still could not finish sampling. My machine hardward setup is

Intel Core i9 - 9900KF - 128GB Memory - NVIDIA GeForce RTX 2080 SUPER

I think sampling runs on CPU.

@giddyyupp Thanks for the reply and I think that's a good suggestion.

bryanbocao commented 2 years ago

@giddyyupp I added the coco_minitrain_25k.zip download link, @sholevs66 you may download the 25k minitrain dataset in this zip file without resampling yourself.

Pull request: https://github.com/giddyyupp/coco-minitrain/pull/16 https://github.com/bryanbocao/coco-minitrain/tree/wip