Failed training for test_data: macro_tiles_10x10 and sample_clustered

google-research / circuit_training

Apache License 2.0

1.2k stars 189 forks source link

Failed training for test_data: macro_tiles_10x10 and sample_clustered #32

Open samli50801 opened 2 years ago

samli50801 commented 2 years ago

We have trouble running these two testing data: macro_tiles_10x10 and sample_clustered, although we succeeded in executing the training job for Ariane. The problem is that the tmux session for collect_job is stuck at model_id 0/step 0 and does not progress, which is show in the following figure.

MicrosoftTeams-image

For these two testing data, we typed the same commands as we tested Ariane except the following changes:

$ export NETLIST_FILE=./circuit_training/environment/test_data/sample_clustered/netlist.pb.txt $ export INIT_PLACEMENT=./circuit_training/environment/test_data/sample_clustered/initial.plc

Are we required to make other changes to the commands?
Shall we modify the hyperparameters to fit each testing data, such as learning_rate, batch_size, etc?

Maria-UET commented 2 years ago

I also got stuck with model_id 0/step 0 when i trained for Ariane example AFTER changing the sizes of the macros randomly to bigger numbers. Can you compare the dimensions of macros in your netlist with those in Ariane? May be it is the limitation on macro size in comparison to grid cell size.

samli50801 commented 2 years ago

We haven't found the clue between macro size and grid cell size yet. However, we found that model_id/step will stuck at 0 after we only modify the grid size from 35x33 to 35x32 in Ariane. We're still looking for the reasons.

ZFTurbo commented 2 years ago

I also wasn't able to run these 2 tests. Training stuck at iteration 0. And nothing is happening.

sakundu commented 2 years ago

macro_tiles_10x10 has 100 macros. So you have to update the "sequence_length" to 101 in train_ppo.py and "max_sequence_length" to 101 in ppo_collect.py before starting the training.

"sequence_length" and "max_sequence_length" should be set to (TOTAL_NUMBER_OF_MACRO + 1) for training. The default value is 134 because the ariane test case has 133 macros.

You may check out our MacroPlacement repo.

We provide a script to write out the protobuf netlist from the Innovus tool which can be used as input to the CircuitTraining grouping code to generate the clustered netlist.

Also, you can train circuit training using the testcases available in MacroPlacement/Flows/\<Enablement>/\<Design>/netlist/output_CT_Grouping/.

If you are interested to learn more about proxy cost computation, you can visit this link.

You can find here the progress we have made till August 26.