How to resume training?

bcli4d commented 5 years ago

Hello again, I've been working on reproducing your mutation results, running on a single puny Google VM with one GPU, fully training the model. However, after 300K batches, my process halted. I think Google had some kind of networking problem yesterday. So... I tried to resume training by setting the pretrained_model_checkpoint_path parameter to the model.ckpt-300000.data-00000-of-00001 intermediate checkpoint file, but without success. Do you know if it is possible to resume training from such a checkpoint?

BTW, even at this point I'm seeing an AUC of 0.76 for EGFR, though the other mutations look like a coin toss Screen Shot 2019-06-03 at 9 44 48 PM I refined the mutation calling by excluding silent mutations. Here at ISB-CGC we have all the TCGA genomics data as well as pathology image metadata in BigQuery, so it was easy to create an SQL query to create a mutation calling manifest. We also have TCGA pathology images available in GCS, so I am pulling pathology image from GCS rather than from TCIA.

Did you find it necessary to fully train the model for mutation classification or was transfer learning sufficient?

Looking foward to your next paper.

Thanks as always, Bill

ncoudray commented 5 years ago

Hi Bill,

For the mutations, we did not to transfer learning. we re-trained from scratch, but using only tiles that were previously classified as LUAD.

To resume training, have you checked the "pretrained_model_checkpoint_path" option?

HTH, Nicolas

bcli4d commented 5 years ago

Nicolas,

Got it working: In the current tensorflow (I'm running 1.13), a checkpoint is composed of three files. For example for the checkpoint that I wished to restore, these are model.ckpt-300000.data-00000-of-00001, model.ckpt-300000.index and model.ckpt-300000.meta. It turns out that the --pretrained_model_checkpoint_path value should /full/path/to/model.ckpt-300000

Regards, Bill

ncoudray commented 5 years ago

Great! N.

ncoudray / DeepPATH

How to resume training? #36