Preparing input files - Githubissues

francicco commented 1 year ago

Hi,

I'm trying to understand how to prepare the input files for PhyloAcc. It's the first time I try to run these types of analyses and I'd be very grateful if you could help me with that.

From what I understood PhyloAcc requires concatenated CNEE regions in a fasta file followed by the coordinates in bed format for each. In order to do that I need to generate a PhastCons output, and for that, I need the model generated from PhyloFit.

The first question would be do I have to specify particular regions (CDS/Exons/UTRs/Intergenic) or use the entire alignment instead?
Once I have the model I can run PhastCons giving intergenic regions, I guess. Can I also give it UTRs or CREs if I know them?
I'm looking at the bed file simu_500_200_diffr_2-1.bed in PhyloAcc-test-data, what are the different fields? The first 3 columns I guess are the coordinates, what about the others?

Really really thank you. Best, F

gwct commented 1 year ago

I actually haven't run an analysis starting from the phastCons step yet, but I'll try to help as much as I can. Maybe @tsackton or @xyz111131 can chime in and help catch anything I miss.

I'm not sure what you mean in this question by "the entire alignment". Do you have a whole genome alignment?
It is my understanding that you predict the neutral rates with phyloFit using regions of the genome assumed to have experienced only (or predominantly) neutral substitutions. The PhyloAcc papers used 4-fold degenerate sites in coding sequences. Then this model is used as input to phastCons along with whatever region of the genome you want to predict conserved regions in. If your region of interest is all intergenic sequences then you can use that. If you are interested in UTRs and CREs then you can use those as well. I think in the past for PhyloAcc this has been done on a whole genome alignment.
The first three are the only required columns and are like any other bed file (region, start coordinate, end coordinate). The 4th column is an optional ID column. The extra columns are information about the simulation that you can ignore.

francicco commented 1 year ago

Hi @gwct ,

Thanks a lot for your reply. Yes, when I say the entire alignment I mean the whole genome alignment (hal file converted to MAF/FASTA). Ok, so you say to use 4-fold degenerate sites. Did you use msa_view from the phast package? If so I generate a 4d-codons.ss file and I use it with phyloFit to generate the neurtal model, is that correct?

Thanks a lot F

tsackton commented 1 year ago

Yes using msa_view to get 4d-codons.ss and then phyloFit is a good option to generate the neutral model. Then, you will need to run phastCons to generate conserved regions, and exact the alignments of the conserved regions from the MAF, which can then be the input to phyloAcc (with the .mod file from phyloFit).

francicco commented 1 year ago

Hi @tsackton,

Thanks a lot! Do I need to use multiple annotations and combine the 4d-codons.ss or using a single reference is enough? I have so many questions! :D F

tsackton commented 1 year ago

You should use the same reference annotation as the reference genome in your MAF.

On Tue, Feb 28, 2023 at 4:09 PM Francesco Cicconardi < @.***> wrote:

Hi @tsackton https://github.com/tsackton,

Thanks a lot! Do I need to use multiple annotations and combine the 4d-codons.ss or using a single reference is enough? I have so many questions! :D F

— Reply to this email directly, view it on GitHub https://github.com/phyloacc/PhyloAcc/issues/47#issuecomment-1448919978, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JU7CMCIFTFKXOWYKG6FTWZZSQ5ANCNFSM6AAAAAAVKQZIF4 . You are receiving this because you were mentioned.Message ID: @.***>

-- Tim Sackton, PhD Director of Bioinformatics Informatics Group Faculty of Arts and Sciences Harvard University

francicco commented 1 year ago

Ok, and do I have to run it chromosome per chromosome? because msa_view looks at the species tag instead of chr name. F

francicco commented 1 year ago

I guess I have to take care of the reverse complement, right? Would it be possible to have your scripts? What would be very useful! Cheers F

francicco commented 1 year ago

Hi @tsackton,

I've found your github with the script. At the moment I'm looking at this script, is that the correct one? Cheers F

tsackton commented 1 year ago

That is fairly old code and wasn't designed to be rerun, but it should outline the general steps you'll need to implement and should mostly work.

On Wed, Mar 1, 2023 at 5:48 AM Francesco Cicconardi < @.***> wrote:

Hi @tsackton https://github.com/tsackton,

I've found your github with the script. At the moment I'm looking at this script https://github.com/tsackton/ratite-genomics/blob/b1f0f6bace97a67ade9c4264a9bf86fb88f699e5/04_wga/02_ce_id/run_phast.sh, is that the correct one? Cheers F

— Reply to this email directly, view it on GitHub https://github.com/phyloacc/PhyloAcc/issues/47#issuecomment-1449857060, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JU7HZ7UAQBAMNGZ5VXXLWZ4SPJANCNFSM6AAAAAAVKQZIF4 . You are receiving this because you were mentioned.Message ID: @.***>

-- Tim Sackton, PhD Director of Bioinformatics Informatics Group Faculty of Arts and Sciences Harvard University

francicco commented 1 year ago

I have doubts concerning hal2mafMP.py, which keeps failing, is there a better way to parallelize the job? I could split it myself and then maybe concatenate it with msa_view, is that an alternative?

Do I still have to do sed -i -e 2d $MAF, which I don't quite sure what it is for.

If you have new versions of these scripts? I don't have much time to run these analyses before the resubmission of the paper, that would really help!

Thanks a lot! F

francicco commented 1 year ago

Also, hal2mafMP.py --refTargets galGal4_4d.bed ratiteAlign.hal neut4d_input_galGal_ref_ver09222015.maf or hal2maf should produce a single maf file, I don't understand the following msa_view --aggregate, am I missing something?

neut4d_input_galGal_ref_ver09222015.maf this is the input for phyloFit, am I correct?

Sorry for bothering you with so many questions! F

tsackton commented 1 year ago

I haven't really done this stuff for several years now, so unfortunately I don't have updated versions of any scripts. We ultimately hope to provide some more packaged options for preparing inputs to phyloAcc but it is unlikely we'll be able to do this soon.

francicco commented 1 year ago

Ok, thanks a lot! I'll see what I can do by myself. I'm so sorry if I've bothered you.

Best, F

tsackton commented 1 year ago

I'm happy to see if we can help debug anything that isn't working, so feel free to post error messages you get if you run into something you can't debug and we'll see if we can help. But I don't think I'll be able to write new stuff in the near future to do this, although it is on the long-term roadmap for PhyloAcc.

francicco commented 1 year ago

Dear @gwct

I have a theoretical question on how to generate conserved elements. I have a WGA of 63 species. With phyloacc I wanted to look for accelerations of non-coding regions on a specific branch. This involves the majority of the species (44 species). If I generate my conserved elements using the entire dataset, wouldn't I mask the acceleration on my target branch? Would it be better to generate the conserved elements on the outgroup only, the 19 species not involved in my target branch?

Thanks a lot Francesco

francicco commented 1 year ago

Dear @tsackton,

Thanks a lot, but my problem isn't necessarily in the debugging, but rather in how to execute things correctly, without violating the procedures. For example, since hal2maf is now changed I can't reproduce your pipeline, and I'm not sure if I'm introducing any error in converting files.

Thank you anyways. F

gwct commented 1 year ago

That's a good question. Since the branch you're looking at is relatively deep in your sampling I can see what you're concerned about. But, just because an element is accelerated on that single internal branch does not mean it will also be accelerated in all of the descending branches. Even though that actually is an assumption of the current implementation of PhyloAcc, this shouldn't affect the prediction of conserved elements

So I don't think its necessary to subset your alignment for conserved element prediction, but since I've actually never run this part of the pipeline I would wonder if @tsackton has any thoughts.

tsackton commented 1 year ago

Generally speaking, you will fail to detect conserved elements that are accelerated on a substantial subset of your lineages in your alignment, but as far as I know there is no general consensus on when this is a sufficiently large problem to be worth addressing by e.g. supplementing your main CE detection with additional CEs detected in an alignment with the target species removed.

I would consider it primarily if the target species, where you expect acceleration, are more than say 30% of your total species. In that case it might be worth making a second alignment with your target species removed and also running phastCons on that alignment.

But it is probably not crucial.

francicco commented 1 year ago

Basically what I was doing. I took all the outgroup species + 4 out of 44 of my target species, and I used one of the outgroup species as a reference. What is actually interesting is that with the same model but using one ingroup as a reference I get a lot more CE, 10 times less. Make sense right?

phyloacc / PhyloAcc

Preparing input files #47