kundajelab / ChromDragoNN

Code for the paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts"
MIT License
44 stars 11 forks source link

Make trained model available for prediction #3

Closed jhammelman closed 1 year ago

jhammelman commented 4 years ago

I've been trying to train the resnet.py model to predict accessibility from DNA sequence (on the provided DNAse-seq data). I've ran into some issues (listed below). Is it possible for you to provide the trained model weights and a script to initialize the trained model and predict on new DNA?

1) In the training script, the model seems to expect held out cell types instead of held out chromosomes. I manged to fix this issue by changing line 31 in runner.py which assumes the held out data is cell types, but I'm still not sure it's training correctly (see issue #2).

Train command: python resnet.py --dnase /data/packbited --rna_quants /data/rna_quants_1630tf.joblib -bs 32 -cp /data/checkpoint-resnet -ho chromosomes -tl 18 -vl 19 > resnet-train-report.txt 2>&1

2) Once I get the model trained, the performance is poor on the held out chromosome (OVERALL AUPRC = 0.068, AUC = 0.498).

Test command: python simple.py --dnase /data/packbited --rna_quants /data/rna_quants_1630tf.joblib -bs 32 -s1f ../stage1/resnet.py -s1m /data/checkpoint-resnet/ -ho chromosomes -tl 18 -vl 19 -rfn resnet-report.txt -evmode report -ev 1 --with_mean 0 -cp /data/stage2-resnet

suragnair commented 4 years ago

Hi Jennifer. If I understand correctly, you are interested in training a sequence-only ResNet model and then running the model on held out sequences? And thus you do not require a model that can make predictions in new cell types?

jhammelman commented 4 years ago

Hi Surag,

Sorry for the late reply. You are correct. I was hoping to get a sequence-only ResNet model which I thought I could train if I held out chromosomes in stage 1, based on the description: "The stage 1 models predict accessibility across all training cell types from only sequence, and does not utilise RNA-seq profiles"

I realize by reading your paper that may not be what the stage1 model of ChromDragoNN does? I assumed based on the description that the output of the stage1 model would be 123 probabilities representing opening chromatin in each cell type.

-Jen

On Fri, Dec 20, 2019 at 1:38 AM Surag Nair notifications@github.com wrote:

Hi Jennifer. If I understand correctly, you are interested in training a sequence-only ResNet model and then running the model on held out sequences? And thus you do not require a model that can make predictions in new cell types?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/ChromDragoNN/issues/3?email_source=notifications&email_token=ABKOTQ7HQ2WKTUAGBI7FSXTQZRR6FA5CNFSM4J3UUBN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHMBLUI#issuecomment-567809489, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKOTQ767R2APKEDNCV2YWLQZRR6FANCNFSM4J3UUBNQ .

suragnair commented 4 years ago

Most likely the training command should be correct. You should not need to change runner.py for the training step I believe. Correct me if I'm wrong.

The test command won't work since it'll load the weights into an untrained stage 2 model and try to make predictions from sequence + RNA-seq (which it hasn't been trained on).

I actually don't have code to evaluate the stage 1 model on the test set since it is not something we require for ChromDragoNN (final test evaluation happens on held out cell types).

To evaluate just the stage 1 model on your chromosome of choice, I'd suggest the following steps:

Would that be possible? Do let me know if you have any questions!

Also, during training, was the validation AUPRC reasonable? I would say you can trust that number at the very least.