Running Puffin for mouse genome

AndreaMariani-AM commented 5 months ago

Hi,

Thanks for the great tool. I have two questions: 1) I've seen that you have trained Puffin on both Hg38 and mm10 genomes. I was wondering whether the model weights provided on zenodo (.pth) are hg38 only and if there's a way to retrieve the mm10? Or given that you state in the manuscript "Moreover, applying the human Puffin model to mouse sequences achieves almost identical performance as the mouse model (Fig. 6B). Thus the transcription initiation sequence dependencies learned by Puffin between human and mouse are nearly interchangeable." you decided to use only the Human one? I'd like to make some prediction of sets of mm10 sequences and wondering how to best approach this.

2) Right now, Puffin and Puffin-D accept one string as an input, is that correct? The input cannot be a list of sequences (in that case i'll write my own routine to get a multi sequence prediction :) )

Thank you very much, Andrea

KseniiaDundyk commented 4 months ago

Dear Andrea,

Thank you for your interest in our work!

Yes, since predictions made on mouse sequences using both the human Puffin model and the mouse Puffin model are almost identical, we believe it is mostly safe to apply the human model to the mouse genome. We do have a stage 1 mouse Puffin model (as you may recall from the manuscript we trained Puffin in three stages). The main difference between the stage 1 and stage 3 models is that the stage 1 model is less interpretable. Therefore, if interpretability is a priority for you, it would be easier to use the human Puffin model on the mouse genome. If you are still interested in the model trained on mouse data, we can provide the mouse .pth file.
If you want to submit multiple sequences, you can use a fasta file as input for Puffin and Puffin-D. However, if you are interested in making predictions for a batch of sequences simultaneously to speed up prediction time, you can do so by modifying the code: you need to convert the sequences into one-hot encoding and then save them into a numpy array with a shape of batch size X 4 X sequence length. Then, you can run puffin.forward() or puffin_D.forward() using this array as input.

Best, Kseniia Dudnyk

AndreaMariani-AM commented 4 months ago

Dear Kseniia,

This answer perfectly both my questions! Thank you very much! i'll close the issue :).

Have a great day,

Andrea

jzhoulab / puffin

Running Puffin for mouse genome #7