jzhoulab / puffin

deep learning-inspired explainable sequence model for transcription initiation
https://puffin.zhoulab.io
Other
81 stars 6 forks source link

Use Puffin to annotate variant? #5

Closed fransilvionGenomica closed 2 months ago

fransilvionGenomica commented 3 months ago

Hello,

Thank you for an amazing tool! In your README you are saying Puffin can be used for "Predict the effect of mutations on transcription initiation and study its mechanisms." If I have a VCF file with variants in TSS regions, what is the best way to use Puffin to check their effect on transcription initiation?

KseniiaDundyk commented 3 months ago

I'm glad you like our tool! To estimate the mutation effect using the Puffin model, you can reconstruct the wild-type and mutant genomic sequences from a VCF file and compare Puffin's predictions for both sequences. This will allow you to observe how the mutation changes the transcription initiation signal. For example, you can calculate the difference in transcription initiation signal within a specific window in TSS region.

  1. If you want to check a handful of mutations, you can use our interactive website puffin.zhoulab.io. You can paste the sequence of interest, and then you can download the predicted tracks. Alternatively, you can input the genomic coordinates and manually alter the sequence. Additionally, if you want to visualize predictions for both wild-type and mutant sequences simultaneously, you can use the 'Compare' tool located next to the 'Home' button. Puffin web-server is also useful if you want to investigate why a specific mutation affects a transcription initiation signal (for example does this mutation cause change in motif activation, motif deletion or shift?).
  2. If you have many mutations that you want to study, you can run Puffin via the command line using sequences saved in a FASTA file, or you can use the Puffin API.
  3. If you prioritize prediction accuracy over interpretability, you can use the Puffin-D model instead of Puffin. Puffin-D is a prediction-focused model, but it is not as interpretable. Similarly, you can run Puffin-D via the command line, use the Puffin-D API, or utilize the Puffin-D web server.

I hope this is helpful!

fransilvionGenomica commented 2 months ago

@KseniiaDundyk Thank you. When I use "puffin_D.py sequence fasta_file" command and I have sequences where variants are located on the forward strand, do I need to provide forward sequences to the model and collect only first 5 output channels? And vice versa, if variants are on the reverse strand, should I give the model reverse complement sequences and collect last 5 output channels? Or is the model strand-agnostic, and I can just always give it forward strand sequences?

KseniiaDundyk commented 2 months ago

Either way should work; you can always input forward strand sequence and use the first five output channels when the gene of interest is located on the ‘+’ strand and the last five output channels when the gene is located on the ‘-’ strand. Or you can give the model the reverse complement sequence, when the gene is located on the ‘-’ strand, but in this case, you will need to use the first five channels.