Downsampling experiment

We're interested in checking how well the object learner works when it has fewer samples to learn from. We want to run experiments where it has fewer images or samples per object to learn from.

As part of this task, you'll need to:

Run the downsampling script on the M5 objects train curriculum, downsampling to k=2 and k=10 images/samples per object. So there would be two curriculum, say m6_objects_downsampled_2pertype and m6_objects_downsampled_10pertype.
Train one object GNN module (#1110 / #1151) on each of the downsampled curricula. So, two GNNs trained.
Evaluate both the GNNs on train and on test. It's worth recording train accuracy for completeness though it doesn't tell us much. The main reason to run inference on train is to produce the decode files ADAM needs (the feature.yaml s). Also record the test accuracies. I'd expect this to be very similar to ADAM's test accuracy in general.
Run ADAM with the subset learner over the resulting curricula & decodes. Parameters would be similar to the M5 objects curriculum except for the train curriculum.
Collect/compare accuracy results between k=2, k=10, and baseline (m5_objects_v0_with_mugs). It will probably be interesting here to look at and analyze the per-object results as well as the overall accuracy. I think this means:
1. Plotting the three GNN test accuracy values, probably as a bar chart.
2. Plotting the three GNN train accuracy values.
3. Separately plot the ADAM train/test accuracies? Or only test if that is easier. (I don't think we do a final pass through train so evaluating on train may be more annoying than it's worth.)
4. Check that you don't red and green bars within a chart due to colorblindness. I think the default palette goes blue->orange->gray, so this shouldn't be a problem, but worth checking.
5. Plotting the six GNN confusion matrices: One for each of the [(three models) X (two splits, train vs. test)].
  1. If you're not familiar with confusion matrices, this would be a table with 20 rows corresponding to "actual object type", and 20 columns to "the GNN's output prediction/what kind of object the GNN thought it was." Each cell is a number saying how many times the GNN thought that, for example, a chair was an apple vs. a chair vs. a desk, ...
  2. It might be worth making confusion matrices for ADAM separately, though because ADAM outputs multiple labels per sample that gets weirder.
6. A writeup containing these images and some discussion of how sample size seems to affect (or not affect!) learning.
  1. This might also include a table showing the overall accuracy results by sample size and split.
  2. Generally I expect overall accuracy to suffer with fewer train samples. This is something to look out for. If accuracy goes up with fewer train samples, there is probably a bug somewhere. Assuming accuracy dues suffer the result is probably worth pointing out in one sentence but not worth discussing much.
  3. I think the main topic of interest is likely to be object accuracy for different types of object. For example maybe it turns out sample size affects windows more than balls.

isi-vista / adam

Downsampling experiment #1156