Prepare pipeline for data retraining

Error323 commented 6 years ago

After all existing trainingdata has been converted to V3 to remedy bugs #236 and #231 a full retraining will be performed. This will have several benefits:

restore full backwards compatibility for all nets (balanced elo strength increments for players of all levels)
remove unused neural network input planes (improve efficiency slightly)
higher elo strength (expected)

As all simulated data is already present it will be much faster than before. Expected training time is ~24 hours. Next step would be a forced version (V0.5) upgrade and uploading the new net. I think that only after this undertaking we should start looking at bigger networks (128x8 or 128x10 or...).

CMCanavessi commented 6 years ago

Awesome. What do you plan to do with the current networks and the current website layout? Will all the current networks be removed from the website? Maybe move them to some new "Old/deprecated networks" area? Will you also automatically upload the new trained networks as they become available (like it works now), or will you do the full training and then manually select 10-20-50 networks in ascending strenght and upload them? Will this "retraining" be done "in private" or we'll be able to follow progress or matches like we currently can every time a new network is uploaded and tested by all the clients?

evalon32 commented 6 years ago

You could also remove unused output channels. What I mean: NUM_OUTPUT_POLICY = 1924 includes 66 promotions for white and 66 for black. After fixing #236, we won't need the last 66.

Error323 commented 6 years ago

Awesome. What do you plan to do with the current networks and the current website layout? Will all the current networks be removed from the website? Maybe move them to some new "Old/deprecated networks" area?

We'd need to discuss. But I guess it would make sense to initiate a second run run2. Upload each net as it comes available so that matches will be played and rebuild an elo curve from that. @glinscott @killerducky what's your opinion on this?

Will you also automatically upload the new trained networks as they become available (like it works now), or will you do the full training and then manually select 10-20-50 networks in ascending strenght and upload them?

It will be fully automated. The groundworks is already there, we need dataconversion and a new ChunkSource that feeds new data in the pipeline as it's added into the directory.

Will this "retraining" be done "in private" or we'll be able to follow progress or matches like we currently can every time a new network is uploaded and tested by all the clients?

We should do everything as public as possible (obviously). The reason it's not all yet is solely a lack of time getting to know (management/configuration) of new hardware. I'll make sure we have full visibility to the tensorboard and learningrate schemes. Neural net uploads are performed at learningrate boundaries as training progresses. My suggestion would be to try something like the following scheme:

%YAML 1.2
---
name: 'run2-64x6'
gpu: 0

dataset:
    num_chunks: 250000
    train_ratio: 0.90
    input: '/run2/'

training:
    batch_size: 2048
    total_steps: 600000
    shuffle_size: 1048576
    lr_values:
        - 0.2
        - 0.1
        - 0.02
        - 0.01
        - 0.002
        - 0.001
        - 0.0002
        - 0.0001
    lr_boundaries:
        - 75000
        - 150000
        - 225000
        - 300000
        - 375000
        - 450000
        - 525000
    policy_loss_weight: 1.0
    value_loss_weight: 1.0
    path: '/networks/run2'

model:
    filters: 64
    residual_blocks: 6
...

We have 2 computersystems that can perform training in parallel, so another scheme can also be created for a run3 to try new stuff. Suggestions, comments most welcome.

mooskagh commented 6 years ago

I'm for keeping network size and running it only when two bugs and fixed. But I'm not sure about changes in input/output size and it will break things for existing lczero.exe, and it's not clear how much of performance gain will it be. On the other hand, because of bugs old lczero.exe won't play good with new network anyway, so maybe that's fine.

But I expect bugs because we forget to change input/output size somewhere.. So I'd still keep unused planes in input and entries in policy network for now, just so make it easier change.

Error323 commented 6 years ago

I agree, I think we should become more conservative given all the YouTube attention now. Let's make sure we only fix the bugs #236 #231 and convert the data. Keep the code diff small such that we minimize the probability of introducing new bugs.

glinscott / leela-chess

Prepare pipeline for data retraining #247