How was random baseline done in paper?

bkj commented 6 years ago

Hi --

In the paper, you describe fairly strong baseline performance from random architectures. Are you able to give a little more information about how those random baselines were done? Specifically, is that the average of a number of random runs, or just a single random run?

Thanks ~ Ben

quark0 commented 6 years ago

It's based on a single random architecture. The averaged performance of multiple random architectures (with early stopping) can be read from Figure 3. You may refer to this paper for more empirical results of random search in a controlled setting.

bkj commented 6 years ago

What do you mean it can be read from Figure 3 exactly? Is there something in there that shows multiple random models trained w/ the full 2 day regimen?

quark0 commented 6 years ago

Figure 3 shows the performance of architecture snapshots over time, starting from random architectures. So you can check the leftmost of each plot to get an idea how random architectures would work (when trained for a relatively small number of epochs). ~~We did not conduct controlled experiments for random architectures in the full setup~~, but you can find them in the Figure 1 of the AmoebaNet paper.

Edit: I did some additional experiments and here're the accs for four independent samples of random architectures under the full evaluation setup (auxiliary towers included when counting #params):

RANDOM1 96.289997 2.926344MB
RANDOM2 96.329997 3.578408MB
RANDOM3 96.549998 3.096984MB
RANDOM4 96.089997 3.545616MB
DARTS   97.159997 3.645840MB

bkj commented 6 years ago

Ok makes sense. The AmoebaNet paper uses a different search space / training procedure though, right? Yours should be close to ENAS though?

quark0 commented 6 years ago

To clarify, our search space for convolutional cells is actually close to that of NASNet & AmoebaNet, except that we concatenated all intermediate nodes to obtain the cell's output whereas they concatenated unused nodes only. We also tried to mimic their training procedure (by using auxiliary towers, cutout and path dropout), but it's possible that there might be some implementation discrepancies.

That being said, you can reproduce the results of NASNet and AmoebaNet using our code (arch specifications of NASNet-A and AmoebaNet-A are provided in genotypes.py) on CIFAR-10 with up to about 0.2% difference in the error rate. You can find these results in our Table 1.

karandwivedi42 commented 6 years ago

@quark0 thanks for open sourcing the code! @bkj I am sharing some results of the runs I did. Hopefully you can share your genotypes and associated val acc too.

Results:

NETWORK1 : 96.979997 (3.724896MB)
NETWORK2 : 97.129997 (3.833760MB)
DARTS : 97.229997 (3.645840MB)

Associated genotypes:

NETWORK1 = Genotype(
    normal=[
        ('skip_connect', 0),
        ('sep_conv_3x3', 1),
        ('sep_conv_3x3', 1),
        ('skip_connect', 1),
        ('sep_conv_3x3', 0),
        ('skip_connect', 0),
        ('sep_conv_3x3', 2),
        ('sep_conv_3x3', 0)
    ],
    normal_concat=[2, 3, 4, 5],
    reduce=[
        ('avg_pool_3x3', 0),
        ('skip_connect', 1),
        ('skip_connect', 0),
        ('max_pool_3x3', 1),
        ('max_pool_3x3', 1),
        ('skip_connect', 0),
        ('max_pool_3x3', 1),
        ('max_pool_3x3', 2)
    ],
    reduce_concat=[2, 3, 4, 5]
)

NETWORK2 = Genotype(
    normal=[
        ('dil_conv_3x3', 1),
        ('sep_conv_5x5', 0),
        ('sep_conv_3x3', 0),
        ('skip_connect', 1),
        ('sep_conv_3x3', 0),
        ('skip_connect', 1),
        ('dil_conv_3x3', 0),
        ('sep_conv_3x3', 2)
    ], normal_concat=[2, 3, 4, 5],
    reduce=[
        ('sep_conv_3x3', 0),
        ('skip_connect', 1),
        ('max_pool_3x3', 2),
        ('avg_pool_3x3', 0),
        ('sep_conv_3x3', 0),
        ('max_pool_3x3', 2),
        ('max_pool_3x3', 2),
        ('max_pool_3x3', 0)
    ], reduce_concat=[2, 3, 4, 5])

From the val acc logs It looks like the NETWORK1 and NETWORK2 models plateau because of the severe overfitting, so it would be interesting to see more such plots and if using more dropout would have helped improve NETWORK1.

Edit: changed RANDOM to NETWORK because the random sampling is not uniform.

bkj commented 6 years ago

Yes I'll share mine when they're done. Thanks for sharing yours.

quark0 commented 6 years ago

Thanks a lot for sharing the results! Could you give more details about how those random architectures are generated? Specifically, I wonder why there's no conv ops in the reduction cell of RANDOM1.

bkj commented 6 years ago

Yes @karandwivedi42 -- how did you generate the random architectures? Can you upload a gist maybe? I can as well once I have my next set of results.

karandwivedi42 commented 6 years ago

RANDOM2 is uniformly sampled for ~id and~ op. reduce_concat is unchanged because I got an error when I changed it, so decided to keep it as it is (also for fair comparison of choices).

RANDOM1 is uniformly sampled from a subset of the search space (because I wanted to compare arch search v/s human intuition based choices) - sep_conv_3x3 and skip_connect for normal cells and max_pool_3x3, avg_pool_3x3 and skip_connect for reduction cells. I used a simple heuristic: simplest conv for normal cells and only pooling for reduction cells.

Conclusion: RANDOM2 is uniformly sampled and RANDOM1 is uniformly sampled from a subset.

Sorry for no script - I just printed some random numbers in numpy and typed by hand.

quark0 commented 6 years ago

Thanks, @karandwivedi42. It's interesting to notice that more prior knowledge didn't help in your particular case, though the conclusion is possibly subject to variance. It would be informative to actually try multiple random architectures using different seeds (if you have enough resources). You can get a random architecture by printing the genotype at the very beginning of the search process.

BTW, I'm not sure if you have fully randomized the graph topology of RANDOM2 (regardless of the op types on the edges). Currently it's exactly the same as our DARTS cell. Could you double check?

karandwivedi42 commented 6 years ago

@quark0 I see I didn't do random sampling properly but all genotype and results on this page are accurate. You can interpret them as single examples and not random samples.

Yes, trying with different seeds is important and should ideally be a part of the paper. Sorry, I don't have enough compute to run them.

karandwivedi42 commented 6 years ago

Also I agree that the models didn't perform well but I am curious if it had been the same for say slightly higher dropout.

The 3 curves are very close before the plateau (where basically train signal becomes very close to zero)

quark0 commented 6 years ago

Thanks. Now your results make more sense to me. FYI, I've added the results of several uniformly random architectures in one of my previous comment. Closing the issue as all questions have been answered and the ticket is not related to any bugs in the code.

karandwivedi42 commented 6 years ago

Thanks! Looking forward to @bkj 's results too. Can you also add the Genotypes you used?

quark0 commented 6 years ago

RANDOM1 = Genotype(normal=[('max_pool_3x3', 0), ('dil_conv_3x3', 1), ('max_pool_3x3', 1), ('dil_conv_3x3', 2), ('avg_pool_3x3', 0), ('sep_conv_3x3', 3), ('dil_conv_3x3', 4), ('avg_pool_3x3', 2)], normal_concat=[2, 3, 4, 5], reduce=[('avg_pool_3x3', 1), ('avg_pool_3x3', 0), ('sep_conv_5x5', 2), ('dil_conv_3x3', 1), ('skip_connect', 3), ('dil_conv_5x5', 0), ('sep_conv_3x3', 1), ('avg_pool_3x3', 4)], reduce_concat=[2, 3, 4, 5])
RANDOM2 = Genotype(normal=[('avg_pool_3x3', 0), ('avg_pool_3x3', 1), ('max_pool_3x3', 0), ('skip_connect', 2), ('max_pool_3x3', 3), ('dil_conv_5x5', 0), ('skip_connect', 3), ('sep_conv_3x3', 0)], normal_concat=[2, 3, 4, 5], reduce=[('max_pool_3x3', 0), ('sep_conv_5x5', 1), ('skip_connect', 2), ('avg_pool_3x3', 1), ('dil_conv_3x3', 3), ('dil_conv_5x5', 0), ('dil_conv_3x3', 2), ('skip_connect', 4)], reduce_concat=[2, 3, 4, 5])
RANDOM3 = Genotype(normal=[('dil_conv_5x5', 0), ('avg_pool_3x3', 1), ('skip_connect', 0), ('skip_connect', 1), ('sep_conv_3x3', 3), ('sep_conv_3x3', 2), ('skip_connect', 4), ('skip_connect', 0)], normal_concat=[2, 3, 4, 5], reduce=[('dil_conv_5x5', 1), ('dil_conv_3x3', 0), ('sep_conv_5x5', 0), ('sep_conv_3x3', 1), ('skip_connect', 1), ('sep_conv_3x3', 3), ('sep_conv_5x5', 1), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
RANDOM4 = Genotype(normal=[('avg_pool_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_5x5', 2), ('dil_conv_3x3', 1), ('avg_pool_3x3', 2), ('dil_conv_3x3', 1), ('dil_conv_3x3', 3), ('dil_conv_5x5', 1)], normal_concat=[2, 3, 4, 5], reduce=[('dil_conv_3x3', 0), ('max_pool_3x3', 1), ('dil_conv_5x5', 0), ('max_pool_3x3', 2), ('skip_connect', 3), ('sep_conv_5x5', 0), ('dil_conv_5x5', 0), ('dil_conv_5x5', 4)], reduce_concat=[2, 3, 4, 5])

bkj commented 6 years ago

Results for various searches and training:

DARTS (run1):
Genotype(normal=[('sep_conv_3x3', 1), ('skip_connect', 0), ('skip_connect', 0), ('dil_conv_3x3', 2), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('skip_connect', 0), ('skip_connect', 1)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 0), ('sep_conv_3x3', 1), ('avg_pool_3x3', 0), ('skip_connect', 2), ('avg_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 2), ('avg_pool_3x3', 0)], reduce_concat=range(2, 6))
2018-07-06 17:39:47,075 train_acc 99.289998
2018-07-06 17:39:47,331 valid 000 5.295296e-02 96.875000 100.000000
2018-07-06 17:39:50,579 valid 050 1.402304e-01 96.589050 99.918300
2018-07-06 17:39:53,824 valid 100 1.356149e-01 96.833743 99.896864
2018-07-06 17:39:54,049 valid_acc 96.829997

DARTS (run2):
Genotype(normal=[('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('skip_connect', 0), ('dil_conv_5x5', 1), ('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 1), ('skip_connect', 0)], normal_concat=[2, 3, 4, 5], reduce=[('avg_pool_3x3', 0), ('max_pool_3x3', 1), ('dil_conv_5x5', 2), ('avg_pool_3x3', 0), ('skip_connect', 2), ('dil_conv_5x5', 3), ('skip_connect', 2), ('avg_pool_3x3', 0)], reduce_concat=[2, 3, 4, 5])
2018-07-07 01:39:54,719 train_acc 98.909998
2018-07-07 01:39:54,963 valid 000 8.626432e-02 97.916664 100.000000
2018-07-07 01:39:58,933 valid 050 1.333493e-01 97.058821 99.897875
2018-07-07 01:40:02,901 valid 100 1.194686e-01 97.256598 99.927805
2018-07-07 01:40:03,180 valid_acc 97.279998

DARTS (run3 -- original architecture):
Genotype(normal=[('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('skip_connect', 2)], normal_concat=[2, 3, 4, 5], reduce=[('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('max_pool_3x3', 0), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 2), ('avg_pool_3x3', 0) ], reduce_concat=[2, 3, 4, 5])
2018-06-30 20:09:24,087 train_acc 99.147998
2018-06-30 20:09:24,338 valid 000 7.525486e-02 97.916664 100.000000
2018-06-30 20:09:28,351 valid 050 1.391192e-01 96.813723 99.959150
2018-06-30 20:09:32,364 valid 100 1.296854e-01 97.184403 99.958746
2018-06-30 20:09:32,644 valid_acc 97.189997

DARTS (run4):
Genotype(normal=[ ('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 0), ('skip_connect', 1),], normal_concat=[2, 3, 4, 5], reduce=[('max_pool_3x3', 0),('max_pool_3x3', 1),('max_pool_3x3', 0),('skip_connect', 2),('max_pool_3x3', 0),('dil_conv_5x5', 3),('skip_connect', 2),('skip_connect', 3),], reduce_concat=[2, 3, 4, 5])
2018-07-02 21:15:51,864 train_acc 99.275998
2018-07-02 21:15:52,106 valid 000 7.127011e-02 96.875000 100.000000
2018-07-02 21:15:56,139 valid 050 1.245886e-01 96.568625 99.918300
2018-07-02 21:16:00,172 valid 100 1.191070e-01 96.947192 99.938119
2018-07-02 21:16:00,453 valid_acc 96.939997

So, to summarize, I got:

So that seems like more variance than reported in the paper -- certainly on the low side, but also maybe on the high side.

quark0 commented 6 years ago

@bkj thanks for the results! Very informative.

We computed the variance by training the same architecture for 4 times, but the type of variance that you're investigating (i.e. variance across multiple genotypes) is certainly interesting. This kind of variance could be potentially reduced by some architecture selection mechanism (e.g. by always repeating the search for several times and picking the best genotype), but I agree even so the variance could still be quite high and 4 samples may not be sufficient for a very accurate estimation of the true mean & variance.

bkj commented 6 years ago

Ah got it -- the difference in the setup explains the variance I think.

BTW -- I ported/reimplemented the DARTS algorithm here. Mostly for learning purposes, and also to integrate w/ some existing wrappers I maintain. I'll be doing a number of further experiments, will post here if they're relevant.

quark0 commented 6 years ago

Yes, please just post them here or shoot me an email if you prefer. I will always be happy to learn about new findings and answer questions.

D-X-Y commented 6 years ago

Hi @quark0 , may I ask when you consider the valid_acc of one run, did you use the valid_acc of the last epoch, or the best valid_acc from all 600 epochs? I run the experiments two times, get the results on the CIFAR-10 testing set.

96.86
96.91

quark0 commented 6 years ago

We always use the valid_acc in the last epoch. Are you running the latest version of the code? You can find the expected learning curves in the README.

D-X-Y commented 6 years ago

Hi @quark0 , I modified the code to adapt PyTorch 0.4.0.

D-X-Y commented 6 years ago

@quark0 sorry, I find that I use a different weight-decay （0.0001)，I‘m currently re-train DART with the weight-decay of 0.0003, and hope to get a higher result.

quark0 / darts

How was random baseline done in paper? #3