Closed francicco closed 1 year ago
I actually haven't run an analysis starting from the phastCons step yet, but I'll try to help as much as I can. Maybe @tsackton or @xyz111131 can chime in and help catch anything I miss.
Hi @gwct ,
Thanks a lot for your reply. Yes, when I say the entire alignment I mean the whole genome alignment (hal file converted to MAF/FASTA).
Ok, so you say to use 4-fold degenerate sites. Did you use msa_view
from the phast package? If so I generate a 4d-codons.ss
file and I use it with phyloFit
to generate the neurtal model, is that correct?
Thanks a lot F
Yes using msa_view to get 4d-codons.ss and then phyloFit is a good option to generate the neutral model. Then, you will need to run phastCons to generate conserved regions, and exact the alignments of the conserved regions from the MAF, which can then be the input to phyloAcc (with the .mod file from phyloFit).
Hi @tsackton,
Thanks a lot! Do I need to use multiple annotations and combine the 4d-codons.ss
or using a single reference is enough?
I have so many questions! :D
F
You should use the same reference annotation as the reference genome in your MAF.
On Tue, Feb 28, 2023 at 4:09 PM Francesco Cicconardi < @.***> wrote:
Hi @tsackton https://github.com/tsackton,
Thanks a lot! Do I need to use multiple annotations and combine the 4d-codons.ss or using a single reference is enough? I have so many questions! :D F
— Reply to this email directly, view it on GitHub https://github.com/phyloacc/PhyloAcc/issues/47#issuecomment-1448919978, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JU7CMCIFTFKXOWYKG6FTWZZSQ5ANCNFSM6AAAAAAVKQZIF4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Tim Sackton, PhD Director of Bioinformatics Informatics Group Faculty of Arts and Sciences Harvard University
Ok, and do I have to run it chromosome per chromosome? because msa_view looks at the species tag instead of chr name. F
I guess I have to take care of the reverse complement, right? Would it be possible to have your scripts? What would be very useful! Cheers F
Hi @tsackton,
I've found your github with the script. At the moment I'm looking at this script, is that the correct one? Cheers F
That is fairly old code and wasn't designed to be rerun, but it should outline the general steps you'll need to implement and should mostly work.
On Wed, Mar 1, 2023 at 5:48 AM Francesco Cicconardi < @.***> wrote:
Hi @tsackton https://github.com/tsackton,
I've found your github with the script. At the moment I'm looking at this script https://github.com/tsackton/ratite-genomics/blob/b1f0f6bace97a67ade9c4264a9bf86fb88f699e5/04_wga/02_ce_id/run_phast.sh, is that the correct one? Cheers F
— Reply to this email directly, view it on GitHub https://github.com/phyloacc/PhyloAcc/issues/47#issuecomment-1449857060, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JU7HZ7UAQBAMNGZ5VXXLWZ4SPJANCNFSM6AAAAAAVKQZIF4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Tim Sackton, PhD Director of Bioinformatics Informatics Group Faculty of Arts and Sciences Harvard University
I have doubts concerning hal2mafMP.py, which keeps failing, is there a better way to parallelize the job? I could split it myself and then maybe concatenate it with msa_view
, is that an alternative?
Do I still have to do sed -i -e 2d $MAF
, which I don't quite sure what it is for.
If you have new versions of these scripts? I don't have much time to run these analyses before the resubmission of the paper, that would really help!
Thanks a lot! F
Also, hal2mafMP.py --refTargets galGal4_4d.bed ratiteAlign.hal neut4d_input_galGal_ref_ver09222015.maf
or hal2maf
should produce a single maf file, I don't understand the following msa_view --aggregate
, am I missing something?
neut4d_input_galGal_ref_ver09222015.maf
this is the input for phyloFit
, am I correct?
Sorry for bothering you with so many questions! F
I haven't really done this stuff for several years now, so unfortunately I don't have updated versions of any scripts. We ultimately hope to provide some more packaged options for preparing inputs to phyloAcc but it is unlikely we'll be able to do this soon.
Ok, thanks a lot! I'll see what I can do by myself. I'm so sorry if I've bothered you.
Best, F
I'm happy to see if we can help debug anything that isn't working, so feel free to post error messages you get if you run into something you can't debug and we'll see if we can help. But I don't think I'll be able to write new stuff in the near future to do this, although it is on the long-term roadmap for PhyloAcc.
Dear @gwct
I have a theoretical question on how to generate conserved elements. I have a WGA of 63 species. With phyloacc I wanted to look for accelerations of non-coding regions on a specific branch. This involves the majority of the species (44 species). If I generate my conserved elements using the entire dataset, wouldn't I mask the acceleration on my target branch? Would it be better to generate the conserved elements on the outgroup only, the 19 species not involved in my target branch?
Thanks a lot Francesco
Dear @tsackton,
Thanks a lot, but my problem isn't necessarily in the debugging, but rather in how to execute things correctly, without violating the procedures. For example, since hal2maf is now changed I can't reproduce your pipeline, and I'm not sure if I'm introducing any error in converting files.
Thank you anyways. F
That's a good question. Since the branch you're looking at is relatively deep in your sampling I can see what you're concerned about. But, just because an element is accelerated on that single internal branch does not mean it will also be accelerated in all of the descending branches. Even though that actually is an assumption of the current implementation of PhyloAcc, this shouldn't affect the prediction of conserved elements
So I don't think its necessary to subset your alignment for conserved element prediction, but since I've actually never run this part of the pipeline I would wonder if @tsackton has any thoughts.
Generally speaking, you will fail to detect conserved elements that are accelerated on a substantial subset of your lineages in your alignment, but as far as I know there is no general consensus on when this is a sufficiently large problem to be worth addressing by e.g. supplementing your main CE detection with additional CEs detected in an alignment with the target species removed.
I would consider it primarily if the target species, where you expect acceleration, are more than say 30% of your total species. In that case it might be worth making a second alignment with your target species removed and also running phastCons on that alignment.
But it is probably not crucial.
Basically what I was doing. I took all the outgroup species + 4 out of 44 of my target species, and I used one of the outgroup species as a reference. What is actually interesting is that with the same model but using one ingroup as a reference I get a lot more CE, 10 times less. Make sense right?
Hi,
I'm trying to understand how to prepare the input files for PhyloAcc. It's the first time I try to run these types of analyses and I'd be very grateful if you could help me with that.
From what I understood PhyloAcc requires concatenated CNEE regions in a fasta file followed by the coordinates in bed format for each. In order to do that I need to generate a PhastCons output, and for that, I need the model generated from PhyloFit.
simu_500_200_diffr_2-1.bed
inPhyloAcc-test-data
, what are the different fields? The first 3 columns I guess are the coordinates, what about the others?Really really thank you. Best, F