isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
11 stars 3 forks source link

Generate downsampled objects curriculum #1147

Closed spigo900 closed 2 years ago

spigo900 commented 2 years ago

We want to generate a downsampled version of the objects curriculum. Specifically, we want two variants: One with k=2 images per object, and one with k=10 images per object. (The M5 objects curriculum has k=30 per object.) We want a script to do this downsampling given the curriculum.

Some further details. I want this script to sample at random from the 30 images we have for each object, rather than taking the first k. The downsampling script should take the seed as a parameter. For now, sample from only those images where stroke extraction succeeded. Sample without replacement. If it turns out we don't have enough "good" samples to sample k of them without replacement for some objects, log a warning and take as many samples as we have.

I think we want to do input/output on the processed curriculum format for convenience. If we used the raw format we'd have to update the reorganizing script to handle potentially not having 30 images per object. That's doable but sounds like an annoying tangent.

First step

@sidharth-sundar, as a first step, please write a script to count, for each object type, the number of images where stroke extraction succeeded. The output should be a two-column table, left being the object and right being the success count, e.g. | apple | 30 | or | window | 11 |. You can post that here or use a Google Sheet, whichever is more convenient. I want to know so we know how whether we have 10 good samples for each object.

I am also thinking we may want to do a "relative sample size preserving" variation on this experiment. We know the distribution is not uniform, so this table would tell us how badly we're mangling the distribution. I think these counts (together with the GNN results) would help us decide if a distribution-preserving experiment is worthwhile.

sidharth-sundar commented 2 years ago

@spigo900 Here's the list of objects with their corresponding counts. I worked from the M5 curriculum which, as you've mentioned, has sampled 30 images per object. Further, the preprocessing for the M5 curriculum already handled removing scenes with no stroke graphs, so the following is equivalent to the number of samples for each object in the M5 curriculum: | a apple | 30 | | a book | 30 | | a cup | 30 | | a floor | 30 | | a mug | 30 | | a orange | 30 | | a sphere_block | 30 | | a toy_truck | 30 | | a pyramid_block | 30 | | a banana | 30 | | a box | 29 | | a ball | 29 | | a toy_sedan | 29 | | a cube_block | 28 | | a sofa | 28 | | a table | 27 | | a chair | 26 | | a paper | 25 | | a desk | 20 | | a window | 10 |

These are sorted in descending order, so the only item we're really lacking in is window. Barring that, we have 10 good samples for every object. Also, relative sample size might work here, but that would result in <10 samples for window, which I don't think would be preferable. (Also the formatting for indentations messed up since markdown ignores white space)

spigo900 commented 2 years ago

@sidharth-sundar Thank you. This was useful.

As I'm reading this it looks like we don't need to mess with relative sample sizes -- most have about 30 samples (18 have at least 25). That is, we're pretty close to a balanced distribution as it is. So I think we probably don't want to run a separate "matched distribution" experiment as it's unlikely to show anything interesting vs. the "perfectly balanced" experiment.

spigo900 commented 2 years ago

Also, no problem / need to reformat the above -- it was more than readable enough for this -- but FWIW you may find GitHub-Flavored Markdown's tables helpful for posting tables.