Closed francicco closed 1 year ago
Soved that, but now I've got this:
# 03.21.2023 22:48:15 Reading species/branch groups Success 0.20233 0.00259 54.05859 8725.70703
# 03.21.2023 22:48:15 Detecting compression of seq file Success: No compression detected 0.20281 0.00025 54.05859 8725.70703
# 03.21.2023 22:48:15 Reading input FASTA Success: 63 seqs read 0.20372 0.00072 54.19141 8725.83984
# 03.21.2023 22:48:15 Detecting compression of bed file Success: No compression detected 0.20407 0.00011 54.19141 8725.83984
# 03.21.2023 22:48:15 Reading input bed file Success: 6 loci read 0.20439 0.00014 54.19141 8725.83984
# 03.21.2023 22:48:15 Partitioning alignments by locus Success: 6 alignments partitioned 0.20482 0.00025 54.39844 8725.83984
# 03.21.2023 22:48:15 Calculating alignment stats Success: 6 alignments processed 0.23773 0.0327 54.6875 8733.91406
# 03.21.2023 22:48:15 Writing: phyloacc-aln-stats.csv Success: align stats written 0.23841 0.00041 54.6875 8733.91406
# 03.21.2023 22:48:15 Writing PhyloAcc job files Success: 6 jobs written 0.25254 0.01392 54.74609 8733.91406
# 03.21.2023 22:48:15 Writing Snakemake file Success: Snakemake file written 0.25468 0.00192 54.74609 8733.91406
# 03.21.2023 22:48:15 Writing Snakemake config file Success: Snakemake config written 0.25513 0.00024 54.74609 8733.91406
# 03.21.2023 22:48:15 Writing Snakemake cluster profile Success: Snakemake profile written 0.25551 0.00021 54.74609 8733.91406
# 03.21.2023 22:48:15 Generating summary plots In progress... Traceback (most recent call last):
File "/user/home/tk19812/.conda/envs/phyloacc-env/bin/phyloacc.py", line 175, in <module>
PLOT.genPlots(globs);
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/phyloacc_lib/plot.py", line 69, in genPlots
targets, conserved, outgroups = TREE.categorizeBranches(globs, globs['st']);
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/phyloacc_lib/tree.py", line 832, in categorizeBranches
d1, d2, = tree.desc[node];
ValueError: not enough values to unpack (expected 2, got 1)
Oh I know about this error and I keep meaning to fix it. But it happens after the job files have all been created, so you should be able to just ignore it. You just won't get the nice page with summary plots. I'll see if I can get to this soon, but if you look in the phyloacc-job-files/cfgs/
folder, there should be one config file for every batch you wanted, and the snakemake file should be there as well (phyloacc-job-files/snakemake/run_phyloacc.smk
). You should be able to execute that file to run all the phyloacc batches. See here if you need help with that: https://phyloacc.github.io/readme.html#snakemake
Hi @gwct,
Thanks for your reply. Yes the files are there, I realised that the other day, but for some reason, I get a segmentation fault when I execute them.
Loading input data and running parameters......
Loading program configurations from /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/phyloacc/phyloacc-job-files/cfgs/1-st.cfg......
Unknown parameter: THIN
Loading phylogenetic tree from /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/NeutralModelPhyloACC.mod......
./PhyloAcc.sh: line 28: 276842 Segmentation fault PhyloAcc-ST /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/phyloacc/phyloacc-job-files/cfgs/1-st.cfg
I changed the tree in the model as the output from astral from single copy genes, maybe there's something wrong with that?
ALPHABET: A C G T
ORDER: 0
SUBST_MOD: SSREV
BACKGROUND: 0.011709 0.488291 0.488291 0.011709
RATE_MAT:
-2.355880 0.581503 1.754731 0.019646
0.013945 -0.967485 0.911461 0.042079
0.042079 0.911461 -0.967485 0.013945
0.019646 1.754731 0.581503 -2.355880
TREE: (((((((((((((((((Hcyd:1.0,Hpac:1.0):0.6804018036202207,(Hheu:1.0,Htim:1.0):2.045115939943643):1.0775727658119176,Hmel:1.0):2.5845526956832514,(((Helv:1.0,Hpar:1.0):2.520043184039743,(Hhel:1.0,Hatt:1.0):1.6674287241514767):0.5943716240945114,Heth:1.0):1.2579577656795362):1.5826137755600203,(((Hism:1.0,Hnum:1.0):0.853116661526068,Hbes:1.0):1.6616901002440279,Hnat:1.0):0.8026419610317491):5.907942637561168,((Hbur:1.0,Hege:1.0):0.5681373584773383,Hwal:1.0):6.010212433245333):1.9778295292364656,(((Hxan:1.0,Hdor:1.0):3.166826849161951,Hhie:1.0):1.3420656396541348,Hheb:1.0):5.4781616843381205):1.7937053301035975,Haoe:1.0):1.6545372896321975,((((((((Hcon:1.0,Hele:1.0):1.9056103408781728,(Hsap:1.0,Hhew:1.0):0.0723574664456409):3.375750647347987,Hant:1.0):3.5979134179031544,(Hsar:1.0,Hleu:1.0):3.3139879260346636):6.661203729637612,((Hdem:1.0,Hert:1.0):6.733401891837363,Hric:1.0):0.051381042893976936):0.7606409147268492,(Hper:1.0,Hcha:1.0):6.734195227856759):5.292321714137381,((Hhor:1.0,Hcly:1.0):4.025034532470522,Htel:1.0):4.687512705547947):0.10896937344285544,(((((Heet:1.0,Hhim:1.0):0.024118267198463855,Hlat:1.0):0.1336354728775984,(Herd:1.0,Hpet:1.0):0.6285960561543454):3.400604306532743,Hher:1.0):2.414880217442409,Hhec:1.0):3.087619891558481):5.982172686827108):5.014876114295957,(((Elyb:1.0,Etal:1.0):2.442904994703678,Eali:1.0):1.4989967616218582,((Evib:1.0,Elam:1.0):4.241425963841232,Eisa:1.0):5.257831552909168):5.784125852827229):4.570218310838701,(((Avcr:1.0,Avfl:1.0):0.3414240414621923,Avpe:1.0):4.574779159236899,Djun:1.0):5.923550662892739):3.107797333591328,(((Ptel:1.0,Diul:1.0):2.146008485910313,Dpha:1.0):1.7448773782809164,Pdid:1.0):1.0411040264749596):2.3561332669746173,Smor:1.0):3.4206447730944882,(Mcin:1.0,Jcoe:1.0):1.4339057396001638):0.8566891380633856,Bany:1.0):1.0),Dple:1.0);
Any idea why? Do I need to include the ancestral node in both the tree and the alignment? Thanks a lot F
I tested another tree, a ML one and it returns the same error. F
The tree in the .mod
file should be the tree with branch lengths in terms of neutral substitutions that you got from phyloFit. The coalescent/ASTRAL tree should be provided as another option in the PhyloAcc config files. If you had the interface estimate that tree, then it should be added automatically. If you had this tree beforehand, you should specify the path to it when you run the interface with the -l
option. But based on your call to PhyloAcc it looks like you're just running the species tree model (no -r
option specified, default mode is the species tree model), which doesn't need a coalescent tree.
The error you're seeing though may be related to #45 and #48: are your internal nodes in the tree in the mod file labeled? This is required but not well documented currently, unfortunately. There's a method for labeling the tree here: https://github.com/xyz111131/PhyloAcc_v1#trouble-shooting The ancestral nodes only need to be in the tree, not the alignment.
OOOh, I see. I'll fix these problems and I'll let you know! Thanks a lot! F
It's gonna be a problem. I generate the phyfit model with a subset of species to speed it up. with the whole data It was running for 3-weeks. I'm trying to see if I can update the model with --init-model
. Any other ideas or suggestions about that?
Cheers
F
Hmm, I'm not sure what you mean by its gonna be a problem. If you've already run phyloFit you should be able to just label that tree that you already generated. Does that make sense?
In theory, but I generated the phylofit model using a subset of the species. I have a lot of genomes that are 5mya old. So I didn't consider them in computing the phylofit model. Now the tree in my final model has only the species I used, and not the 63 of them. Makes sense?
Ah yes, I see. Yea, your tree has to be from the full set (sorry I didn't answer that question in #47 earlier). I don't think it should run for 3 weeks. @tsackton or @xyz111131 any ideas why phyloFit would take that long?
phyloFit should be quite fast as all it is doing is estimating substitution rates on a fixed tree. What is the phyloFit command you are running that is taking a long time?
Also, what is your input alignment - is this for example an SS file of 4d sites? Or something else?
it's a maf 4d sites. And this is the command:
phyloFit --tree $TREE --subst-mod SSREV --out-root neut_ --msa-format MAF --sym-freqs --log phyloFit_${SP}_${ITER}.log neut4d_input_$SP.maf
Is there anything in the log file? I've never seen phyloFit take 3 weeks to run but without an error message it is tricky to figure out why it might be stalling.
The one thing I might try is converting the maf to SS using msa_view and then trying again and seeing if that helps.
I don't have the logs anymore, I'll try it again also converting to SS. At the moment I'm reconverting the hal into maf. I hope it will finish soon, and I'll let you know Thanks a lot F
If I try to convert it with msa_view
it says that the maf needs to be sorted. If I use mafSorter it asks me a reference sequence, but in my maf I have multiple ref sequence, and if I five it one it returns segmentation fault
.
I'm kind of stuck...
Hi @tsackton
would it be ok if I run independently for each chromosome and at the end I average across the models with phyloBoot. You did it for the conserved elements, might work for the neutral model as well, right? F
Yes, that should be fine
I set a temporary model for phyloacc, the plotting now works.
Still returns me a segmentation fault. I think I didn't quite get how to give it the coalescent tree.
I tried with -l Astral.tree
+ --theta
, only -l Astral.tree
or only --theta
, but still get seg fault
What am I doing wrong?
Yes, that should be fine
Great @tsackton, thanks a lot! F
You only need a coalescent tree if you run the gene tree model (-r gt
or -r adaptive
). In the log file you sent at the beginning of the thread, you didn't specify -r
so it will run in the default species tree mode (st
). If that's what you want, then you don't need to worry about the coalescent tree.
If you DO want to run the gene tree model, you will have to set -r
to gt
or adaptive
. Note that gt
will take a very long time depending on how many elements and computational resources you have. With adaptive
, it will pick only the elements with the lowest average site concordance factors (sCF) to run with the gene tree model. sCF cutoffs can be controlled by -scf
and -s
. And in this case, you could either give the tree with branch lengths in coalescent units directly as input with -l
, or if none is given you will have to select --theta
and the snakemake workflow will select up to 5000 of your longest elements to generate it for you (via IQ-tree gene trees and ASTRAL). This would also add to the runtime of the pipeline. Either -l
OR --theta
, not both. Does that make sense?
Regarding the segmentation fault: did you add labels to the internal nodes of your tree? Sorry if you said you did already and I missed it.
Thanks a lot! I a bit slow :P
Yes, I thought the tree was fully labeled, but it's actually not in some of them. I'll correct it!
Thanks really a lot! F
Does also the coalescent tree need to be annotated? I guess so, right? F
I don't recall if that is a requirement, but it probably won't hurt.
Hi @gwct,
The first batch of a test run has finished. I've been reading the manual for some insight on how to interpret the results, which seem quite complicated. Lots of information, and most of them are quite interesting but I'm still confused about which one to look at. Can you clarify a bit how to extract the info I most need? Thanks a lot F
Hi @tsackton,
So, After I averaged on chrs I rerun phyloFit
with --subst-mod SSREV --sym-freqs
using --init-model ave_model
on the full dataset. That run for 2 days. then I corrected the resulting model with the base composition.
I guess it should be correct now, right?
F
So the output files listed on the website are those for each individual PhyloAcc run (https://phyloacc.github.io/readme.html#output). However, if you run the phyloacc_post.py
script, it should compile almost everything for you in a single table. Basically, you want to look for loci in which M1 fits better than M0 or M2, which is done when the Bayes factors are large enough. Those loci are accelerated specifically in your target branches (though not necessarily exclusively, but a model that restricts acceleration to these lineages fits better). Then you can look at which lineages specifically are accelerated in a given locus, and the rates for those loci as well.
The output file from phyloacc_post.py
contains a lot, but also has summaries and explanations of each column in the file header. So take a look at that and then let me know if you have any more specific questions!
Hi @gwct,
So, I analysed 13130. Of them, 2994 have M1 in the best.fit.model. If I understood correctly within the accelerated loci there could be taxa that do not belong to my target species, correct? Therefore I have to check in the column "accel.lineages.m1" and exclude loci that have also non-target taxa in it if I want taxa that exclusively are accelerated in my target. Is my interpretation correct?
Also, my input was made of 227,778 loci, and only 13,130 were analysed. I guess phyloacc tests only the informative sites, correct? Thanks a lot!
F
Yes, it doesn't really make sense to run PhyloAcc (or any rate estimation program) on loci without any informative sites since there will inevitably be no accelerations and M0 will be favored. And since we're working on conserved elements, it is definitely possible for many loci to not have any variation at all. In another analysis I've talked with someone about, there are no informative sites in any of their loci, though I don't know how many loci they had.
Though this also depends on the number of species and their divergence times from one another -- as each increases I would expect the number of loci with informative sites to increase as well. I remember you saying you had 60 some species? I would probably expect more loci with informative sites in this case. As a comparison, with the bird data (284001 loci with 43 species), there are only 252 loci without informative sites. What's the approximate divergence time of oldest split in your tree? If it is more than a few million years (just guessing here), then this may be something I would want to look at in the code to confirm informative sites are calculated correctly. The other possibility is that you over-trimmed your alignments. If you use something like G-Blocks, for instance, the default trimming settings oftentimes remove phylogenetic informative sites.
I'm sorry, my bad! the script that should concatenate the data had a but and it returned the wrong CEs, I fixed it. Now:
# 04.03.2023 14:55:31 Reading input bed file Success: 227778 loci read 3.82391 0.3684 1774.16406 8392.47656
# 04.03.2023 14:55:36 Partitioning alignments by locus Success: 227778 alignments partitioned 9.04601 5.22179 4549.83203 11171.12891
# 04.03.2023 14:57:31 Calculating alignment stats Success: 227778 alignments processed 124.17412 115.12702 4921.36719 11538.97266
# INFO: 213 loci have 0 informative sites and will be removed from the analysis.
# 04.03.2023 14:57:32 Writing: phyloacc-aln-stats.csv Success: align stats written 125.48634 1.30807 4921.40625 11546.97266
It will run for a week or so depending on how the cluster is free.
Interestingly enough there is a similar amount of CNEEs 227778 vs 284001 with similarly discarded loci 213 vs 252.
I didn't trim the alignments. They are as they come directly from the alignment. To give you some context, the branch I'm interested in is ~12 mya and most of the taxa are in ~28 mya with the root at ~90.
I'll let you know as soon as the run is finished.
Thanks a lot a sorry again for the confusion. F
Hi @gwct,
The analysis has finished! And it looked faster than I thought. You probably need to guide me through the results so I can understand better what's going on, if you don't mind. I'd really appreciate it.
So phyloacc analysed 227,565 intergenic CEs (I put on the side in intronic ones atm). If I look at the elem_lik.txt
113,008 elements evolved under M0 (~50%), 65,216 under M1 (~29%), and 49,341 under M2 (~22%). Now I'd be interested in the CEs evolving under M1, but I'm confused by the fields conserved.lineages.m1
, accel.lineages.m1
, conserved.lineages.m2
, accel.lineages.m2
.
Specifically, I don't if M1 CEs should not contain any other species not in my target group.
Also, if i'd want to reproduce this figure how would I generate the branch length? Did you guys used iqtree on those elements using the species tree phylogeny?
Is there anything else that could be interesting to look at? Cheers F
Oh, just remembered.
I'm also getting this error
# 04.05.2023 14:12:08 Generating summary plots In progress... Traceback (most recent call last):
File "/user/home/tk19812/.conda/envs/phyloacc-env/bin/phyloacc_post.py", line 494, in <module>
globs = PLOT.genPlotsPost(globs);
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/phyloacc_lib/plot.py", line 394, in genPlotsPost
plt.hist(bf2s, color=PC.coreCol(pal="wilke", numcol=1, offset=2)[0], bins="sturges", edgecolor="#999999");
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/matplotlib/pyplot.py", line 2573, in hist
return gca().hist(
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/matplotlib/__init__.py", line 1423, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/matplotlib/axes/_axes.py", line 6737, in hist
m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
File "<__array_function__ internals>", line 180, in histogram
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/numpy/lib/histograms.py", line 793, in histogram
bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/numpy/lib/histograms.py", line 396, in _get_bin_edges
first_edge, last_edge = _get_outer_edges(a, range)
File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/numpy/lib/histograms.py", line 315, in _get_outer_edges
raise ValueError(
ValueError: supplied range of [-682.54444, inf] is not finite
Correct, loci that fit M1 best should not have non-target lineages listed as accelerated in the accel.lineages.m1
column. M1 specifically restricts other lineages from being in the accelerated state.
Which model results you want to use depend on the question you are asking: do you want to identify loci in which the best fitting model is one that restricts accelerations to your target lineages? If so, then use the loci that fit M1. Or do you want to know the loci for which only your target branches had acceleration given a model that allows acceleration everywhere? If that's the case, then you could take the loci that fit M2 and look for ones that only have your target lineages in the accel.lineages.m2
column. We think that if you have a specific hypothesis to test with target lineages, then it is better to use the M1 results (see the section on page 32-33 of the pre-print for PhyloAcc-GT).
For the figures, I'm not sure exactly how those were created. I think the code is similar to what is here, but that is likely based on old output formats. I usually make figures like that in R with something like ggtree or phytools .
I hope that is helpful.
For the error, it looks like matplotlib
or numpy
can't handle the inf
value. I'm not sure why an inf
would be there so I'll have to look into it more.
Cool, understood. I took M1, so accelerated only in my target, & where the stem of my target species IS present in accel.lineages.m1
. That should indicate the loci that are accelerated only at the stem of my target species. Correct?
I'll look at those scripts. I was just curious how those topologies were generated, because, as conserved elements, they should have limited phylogenetic signal.
Cheers F
Oh, right you mentioned the branch lengths. That is definitely confusing. So as I understand it, you would multiply the branch length in your input tree, presumably in units of relative number of neutral substitutions, by the posterior median of the conserved or accelerated substitution rates for the model you're plotting (e.g. conserved.rate.m1
and accel.rate.m1
for M1).
So if I was plotting a tree for a locus under M1, I would multiply all the branch lengths in my input tree that are in the accel.lineages.m1
column by the number in the accel.rate.m1
column and all the branches in conserved.lineages.m1
by conserved.rate.m1
. And then repeat for the next locus using the values in those columns for that locus. Does that make sense?
Oh, ok! I'd never have thought about that! Thanks a lot! F
Hi @gwct,
I'm summing up the results, which I think look pretty good. There are over 2500 elements that seem to be accelerated, with several GO term enriched. I was wondering if with the current analysis would be possible to do some comparisons between the target branch and other internal branches, for example, would it be correct if I take the elements that have M2
as the best model, and take the lineages in accel.lineages.m2
field?
I'd like to establish if the number of accelerated elements, at my target branch, is more to be expected than by chance. In the ratite paper, you guys do a permutation test, but I think my case is a bit different since I'm not looking for convergence.
Also, is there a quick way to spot ultraconserved elements from elem_lik.txt
, I guess they at least needs to be under M0 model.
Thanks a lot F
Sorry for the delay. I think if you compare to lineages that are accelerated under M2 you'd also have to only look at loci where your target lineages are accelerated in M2. In other words, looking at M2 only, pick out loci where your target lineages are accelerated, then pick another random set of lineages and count the same. The other option would be to pick your random set of lineages and re-run PhyloAcc with those lineages as targets. Then compare that to the results when your lineages of interest were targets using M1. I think both methods are fine. The first one (using M2) is probably faster, and you could easily do a lot of random sampling of lineages to build an empirical null distribution of expected number of times a lineage or set of lineages is accelerated. Though then you may have a different definition of accelerated lineages for your targets for different analyses: identifying loci of interest as those that M1 fits best, and identifying whether these lineages are accelerated more often than others using M2.
For ultraconserved elements I'm not too sure. Maybe you could make a distribution of conserved.rate.m0
for all loci where M0 fits best and pick loci that fall in the lower end of that distribution. But I'm not 100% sure this does what you wanted.
No worries @gwct,
I think that for this paper might be enough, the referee should be happy, but it's something I'm going to explore more in future projects. It seems a very interesting approach. Cheers F
Good luck!
Hi,
I'm getting an error while the data is read:
Not sure why, I made a script to concatenate the data, not sure if I introduced some errors. Files are attached
CNEE.bed.txt CNEE.fasta.txt
Can you please help me? Thank you a lot F