Make trained model available for prediction

jhammelman commented 4 years ago

I've been trying to train the resnet.py model to predict accessibility from DNA sequence (on the provided DNAse-seq data). I've ran into some issues (listed below). Is it possible for you to provide the trained model weights and a script to initialize the trained model and predict on new DNA?

1) In the training script, the model seems to expect held out cell types instead of held out chromosomes. I manged to fix this issue by changing line 31 in runner.py which assumes the held out data is cell types, but I'm still not sure it's training correctly (see issue #2).

Train command: python resnet.py --dnase /data/packbited --rna_quants /data/rna_quants_1630tf.joblib -bs 32 -cp /data/checkpoint-resnet -ho chromosomes -tl 18 -vl 19 > resnet-train-report.txt 2>&1

2) Once I get the model trained, the performance is poor on the held out chromosome (OVERALL AUPRC = 0.068, AUC = 0.498).

Test command: python simple.py --dnase /data/packbited --rna_quants /data/rna_quants_1630tf.joblib -bs 32 -s1f ../stage1/resnet.py -s1m /data/checkpoint-resnet/ -ho chromosomes -tl 18 -vl 19 -rfn resnet-report.txt -evmode report -ev 1 --with_mean 0 -cp /data/stage2-resnet

suragnair commented 4 years ago

Hi Jennifer. If I understand correctly, you are interested in training a sequence-only ResNet model and then running the model on held out sequences? And thus you do not require a model that can make predictions in new cell types?

jhammelman commented 4 years ago

Hi Surag,

Sorry for the late reply. You are correct. I was hoping to get a sequence-only ResNet model which I thought I could train if I held out chromosomes in stage 1, based on the description: "The stage 1 models predict accessibility across all training cell types from only sequence, and does not utilise RNA-seq profiles"

I realize by reading your paper that may not be what the stage1 model of ChromDragoNN does? I assumed based on the description that the output of the stage1 model would be 123 probabilities representing opening chromatin in each cell type.

-Jen

On Fri, Dec 20, 2019 at 1:38 AM Surag Nair notifications@github.com wrote:

Hi Jennifer. If I understand correctly, you are interested in training a sequence-only ResNet model and then running the model on held out sequences? And thus you do not require a model that can make predictions in new cell types?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/ChromDragoNN/issues/3?email_source=notifications&email_token=ABKOTQ7HQ2WKTUAGBI7FSXTQZRR6FA5CNFSM4J3UUBN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHMBLUI#issuecomment-567809489, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKOTQ767R2APKEDNCV2YWLQZRR6FANCNFSM4J3UUBNQ .

suragnair commented 4 years ago

Most likely the training command should be correct. You should not need to change runner.py for the training step I believe. Correct me if I'm wrong.

The test command won't work since it'll load the weights into an untrained stage 2 model and try to make predictions from sequence + RNA-seq (which it hasn't been trained on).

I actually don't have code to evaluate the stage 1 model on the test set since it is not something we require for ChromDragoNN (final test evaluation happens on held out cell types).

To evaluate just the stage 1 model on your chromosome of choice, I'd suggest the following steps:

Instantiate model and data iterator as here https://github.com/kundajelab/ChromDragoNN/blob/4ed1bd9e09f62a86ed549c6b79a5335a72054a5a/model_zoo/stage1/resnet.py#L214-L218 Pass in the same arguments as to the train command. This should load the model from the checkpoint (add flag -rb 1) and data iterator.
The test function here https://github.com/kundajelab/ChromDragoNN/blob/4ed1bd9e09f62a86ed549c6b79a5335a72054a5a/utils/model_pipeline_basset.py#L94 currently runs during training for computing running metrics on validation data. You'll need to modify it to pass in test data sequentially (instead of random validation batches).
To do so you'll have to modify https://github.com/kundajelab/ChromDragoNN/blob/4ed1bd9e09f62a86ed549c6b79a5335a72054a5a/utils/data_iterator.py#L325 to instead fetch the ith batch from the (test) data

Would that be possible? Do let me know if you have any questions!

Also, during training, was the validation AUPRC reasonable? I would say you can trust that number at the very least.

kundajelab / ChromDragoNN

Make trained model available for prediction #3