phyloacc / PhyloAcc

PhyloAcc a software to detect the changes of conservation of a genomic region
GNU General Public License v3.0
27 stars 12 forks source link

Error reading alignment #49

Closed francicco closed 1 year ago

francicco commented 1 year ago

Hi,

I'm getting an error while the data is read:

#
# =============================================================================================================================
 _____ _           _        ___           
| ___ \ |         | |      / _ \          
| |_/ / |__  _   _| | ___ / /_\ \ ___ ___ 
|  __/| '_ \| | | | |/ _ \|  _  |/ __/ __|
| |   | | | | |_| | | (_) | | | | (_| (__ 
\_|   |_| |_|\__, |_|\___/\_| |_/\___\___|
              __/ |                       
             |___/                        
    Bayesian rate analysis of conserved
       non-coding genomic elements

#
# Welcome to PhyloAcc -- Bayesian rate analysis of conserved non-coding genomic elements.
# Version 2.1.1 released on December 6, 2022
# PhyloAcc was developed by Zhirui Hu, Han Yan, Gregg Thomas, Tim Sackton, Scott Edwards, and Jun Liu
# Citation:      https://doi.org/10.1093/molbev/msz049
# Website:       https://phyloacc.github.io
# Report issues: https://github.com/phyloacc/PhyloAcc
#
# The date and time at the start is: 03.21.2023 | 22:14:59
# Using Python version:              3.10.6
#
# The program was called as:         /user/home/tk19812/.conda/envs/phyloacc-env/bin/phyloacc.py -a CNEE.fasta -b CNEE.bed -i id-subset.txt -m NeutralModelPhyloACC.mod -o phyloacc -t Htel;Hcly;Hhor;Hhec;Hher;Hpet;Herd;Heet;Hlat;Hhim;Hcha;Hper;Hert;Hdem;Hric;Hleu;Hsar;Hant;Hele;Hcon;Hhew;Hsap;Haoe;Hheb;Hhie;Hxan;Hdor;Hege;Hbur;Hwal;Hnat;Hbes;Hnum;Hism;Heth;Hpar;Helv;Hatt;Hhel;Hmel;Hcyd;Hpac;Hheu;Htim -c Pdid;Dpha;Diul;Ptel;Djun;Avfl;Avpe;Avcr;Eisa;Elam;Evib;Eali;Etal;Elyb -g Dple;Bany;Mcin;Jcoe;Smor -n 4 -batch 5 -j 10 -part . --overwrite
#
# -----------------------------------------------------------------------------------------------------------------------------
# INPUT/OUTPUT INFO:
# Alignment file:                            /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/CNEE.fasta
# Bed file:                                  /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/CNEE.bed
# Tree/rate file (mod file from phyloFit):   /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/NeutralModelPhyloACC.mod
# Output directory:                          phyloacc
# PhyloAcc run directory:                    phyloacc/phyloacc-job-files
# Log file:                                  phyloacc.log
# -----------------------------------------------------------------------------------------------------------------------------
# DEPENDENCY PATHS:
# Program                                    Specified Path
# PhyloAcc                                   PhyloAcc-ST
# -----------------------------------------------------------------------------------------------------------------------------
# SPECIES GROUPS:
# Group                                      Species                                           
# Targets (-t)                               Htel;Hcly;Hhor;Hhec;Hher;Hpet;Herd;Heet;Hlat;Hhim;Hcha;Hper;Hert;Hdem;Hric;Hleu;Hsar;Hant;Hele;Hcon;Hhew;Hsap;Haoe;Hheb;Hhie;Hxan;Hdor;Hege;Hbur;Hwal;Hnat;Hbes;Hnum;Hism;Heth;Hpar;Helv;Hatt;Hhel;Hmel;Hcyd;Hpac;Hheu;Htim
# Conserved (-c)                             Pdid;Dpha;Diul;Ptel;Djun;Avfl;Avpe;Avcr;Eisa;Elam;Evib;Eali;Etal;Elyb
# Outgroups (-g)                             Dple;Bany;Mcin;Jcoe;Smor
# -----------------------------------------------------------------------------------------------------------------------------
# CLUSTER OPTIONS:
# Option                                     Setting                                           
# Partition(s)                               .
# Number of nodes                            1
# Max mem per job (gb)                       4
# Time per job                               1:00:00
# -----------------------------------------------------------------------------------------------------------------------------
# OPTIONS INFO:
# Option                                     Current setting                                   Current action
# -i:                                        /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/id-subset.txtOnly loci names specified in this file will be tested.
# -r:                                        st                                                All loci will be run with the species tree model of PhyloAcc
# -low-scf:                                  0.5                                               The sCF value to consider as low
# -s:                                        NA                                                Branch sCF values will be averaged for each locus and if the average is below the min sCF it will be run with the gene tree model
# -gt:                                       1                                                 The number of MCMC steps between gene tree sampling for the gene tree model. -mcmc will be scaled up by this number.
# -burnin:                                   500                                               This number of steps in the chain will discarded as burnin (-burnin*-gt)
# -mcmc:                                     1000                                              The number of steps in each chain  (-burnin*-mcmc)
# -chain:                                    1                                                 The number of chains to run
# Loci per batch (-batch)                    5                                                 PhyloAcc will run this many loci in a single command.
# Current processes (-n)                     4                                                 This interface will use this many processes.
# Jobs (-j)                                  10                                                PhyloAcc will submit this many jobs concurrently.
# Processes per job (-p)                     1                                                 Each job will use this many processes.
# --summarize                                False                                             PhyloAcc batch files will be generated and written to the job directory specified above.
# --theta                                    False                                             A species tree with branch lengths in coalescent units will NOT be estimated.
# --overwrite                                True                                              PhyloAcc will OVERWRITE the existing files in the specified output directory.
# --quiet                                    False                                             Time, memory, and status info will be printed to the screen while PhyloAcc is running.
class
loop
# -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Date        Time      Current step                            Status                                  Elapsed time (s)    Step time (s)   Current mem usage (MB)   Virtual mem usage (MB)
# -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# 03.21.2023  22:14:59  Reading input species tree              Success: rooted tree read               0.26929             0.00206         54.02734                 8725.70703          
# INFO: Tree has 63 tips and 64 internal nodes
# Tree read from mod file:                   (((((((((((((((((Hcyd:1.0,Hpac:1.0):0.6804018036202207,(Hheu:1.0,Htim:1.0):2.045115939943643):1.0775727658119176,Hmel:1.0):2.5845526956832514,(((Helv:1.0,Hpar:1.0):2.520043184039743,(Hhel:1.0,Hatt:1.0):1.6674287241514767):0.5943716240945114,Heth:1.0):1.2579577656795362):1.5826137755600203,(((Hism:1.0,Hnum:1.0):0.853116661526068,Hbes:1.0):1.6616901002440279,Hnat:1.0):0.8026419610317491):5.907942637561168,((Hbur:1.0,Hege:1.0):0.5681373584773383,Hwal:1.0):6.010212433245333):1.9778295292364656,(((Hxan:1.0,Hdor:1.0):3.166826849161951,Hhie:1.0):1.3420656396541348,Hheb:1.0):5.4781616843381205):1.7937053301035975,Haoe:1.0):1.6545372896321975,((((((((Hcon:1.0,Hele:1.0):1.9056103408781728,(Hsap:1.0,Hhew:1.0):0.0723574664456409):3.375750647347987,Hant:1.0):3.5979134179031544,(Hsar:1.0,Hleu:1.0):3.3139879260346636):6.661203729637612,((Hdem:1.0,Hert:1.0):6.733401891837363,Hric:1.0):0.051381042893976936):0.7606409147268492,(Hper:1.0,Hcha:1.0):6.734195227856759):5.292321714137381,((Hhor:1.0,Hcly:1.0):4.025034532470522,Htel:1.0):4.687512705547947):0.10896937344285544,(((((Heet:1.0,Hhim:1.0):0.024118267198463855,Hlat:1.0):0.1336354728775984,(Herd:1.0,Hpet:1.0):0.6285960561543454):3.400604306532743,Hher:1.0):2.414880217442409,Hhec:1.0):3.087619891558481):5.982172686827108):5.014876114295957,(((Elyb:1.0,Etal:1.0):2.442904994703678,Eali:1.0):1.4989967616218582,((Evib:1.0,Elam:1.0):4.241425963841232,Eisa:1.0):5.257831552909168):5.784125852827229):4.570218310838701,(((Avcr:1.0,Avfl:1.0):0.3414240414621923,Avpe:1.0):4.574779159236899,Djun:1.0):5.923550662892739):3.107797333591328,(((Ptel:1.0,Diul:1.0):2.146008485910313,Dpha:1.0):1.7448773782809164,Pdid:1.0):1.0411040264749596):2.3561332669746173,Smor:1.0):3.4206447730944882,(Mcin:1.0,Jcoe:1.0):1.4339057396001638):0.8566891380633856,Bany:1.0):1.0),Dple:1.0);
# 03.21.2023  22:14:59  Reading species/branch groups           Success                                 0.27134             0.00169         54.02734                 8725.70703          
# 03.21.2023  22:14:59  Detecting compression of seq file       Success: No compression detected        0.27162             0.00011         54.02734                 8725.70703          
# 03.21.2023  22:14:59  Reading input FASTA                     Success: 0 seqs read                    0.27209             0.00034         54.02734                 8725.70703          
# 03.21.2023  22:14:59  Reading locus IDs                       Success: 1 IDs read                     0.27233             0.0001          54.02734                 8725.70703          
# 03.21.2023  22:14:59  Detecting compression of bed file       Success: No compression detected        0.27253             7e-05           54.02734                 8725.70703          
# 03.21.2023  22:14:59  Reading input bed file                  Success: 0 loci read                    0.27273             8e-05           54.02734                 8725.70703          
# 03.21.2023  22:14:59  Partitioning alignments by locus        Success: 0 alignments partitioned       0.27288             3e-05           54.02734                 8725.70703          
# 03.21.2023  22:14:59  Calculating alignment stats             In progress...                          Traceback (most recent call last):
  File "/user/home/tk19812/.conda/envs/phyloacc-env/bin/phyloacc.py", line 134, in <module>
    globs = SEQ.alnStats(globs);
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/phyloacc_lib/seq.py", line 359, in alnStats
    globs['avg-aln-len'] = PC.mean(sorted_aln_lens);
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/phyloacc_lib/core.py", line 170, in mean
    return sum(data) / len(data);
ZeroDivisionError: division by zero

Not sure why, I made a script to concatenate the data, not sure if I introduced some errors. Files are attached

CNEE.bed.txt CNEE.fasta.txt

Can you please help me? Thank you a lot F

francicco commented 1 year ago

Soved that, but now I've got this:

# 03.21.2023  22:48:15  Reading species/branch groups           Success                                 0.20233             0.00259         54.05859                 8725.70703          
# 03.21.2023  22:48:15  Detecting compression of seq file       Success: No compression detected        0.20281             0.00025         54.05859                 8725.70703          
# 03.21.2023  22:48:15  Reading input FASTA                     Success: 63 seqs read                   0.20372             0.00072         54.19141                 8725.83984          
# 03.21.2023  22:48:15  Detecting compression of bed file       Success: No compression detected        0.20407             0.00011         54.19141                 8725.83984          
# 03.21.2023  22:48:15  Reading input bed file                  Success: 6 loci read                    0.20439             0.00014         54.19141                 8725.83984          
# 03.21.2023  22:48:15  Partitioning alignments by locus        Success: 6 alignments partitioned       0.20482             0.00025         54.39844                 8725.83984          
# 03.21.2023  22:48:15  Calculating alignment stats             Success: 6 alignments processed         0.23773             0.0327          54.6875                  8733.91406          
# 03.21.2023  22:48:15  Writing: phyloacc-aln-stats.csv         Success: align stats written            0.23841             0.00041         54.6875                  8733.91406          
# 03.21.2023  22:48:15  Writing PhyloAcc job files              Success: 6 jobs written                 0.25254             0.01392         54.74609                 8733.91406          
# 03.21.2023  22:48:15  Writing Snakemake file                  Success: Snakemake file written         0.25468             0.00192         54.74609                 8733.91406          
# 03.21.2023  22:48:15  Writing Snakemake config file           Success: Snakemake config written       0.25513             0.00024         54.74609                 8733.91406          
# 03.21.2023  22:48:15  Writing Snakemake cluster profile       Success: Snakemake profile written      0.25551             0.00021         54.74609                 8733.91406          
# 03.21.2023  22:48:15  Generating summary plots                In progress...                          Traceback (most recent call last):
  File "/user/home/tk19812/.conda/envs/phyloacc-env/bin/phyloacc.py", line 175, in <module>
    PLOT.genPlots(globs);
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/phyloacc_lib/plot.py", line 69, in genPlots
    targets, conserved, outgroups = TREE.categorizeBranches(globs, globs['st']);
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/phyloacc_lib/tree.py", line 832, in categorizeBranches
    d1, d2, = tree.desc[node];
ValueError: not enough values to unpack (expected 2, got 1)
gwct commented 1 year ago

Oh I know about this error and I keep meaning to fix it. But it happens after the job files have all been created, so you should be able to just ignore it. You just won't get the nice page with summary plots. I'll see if I can get to this soon, but if you look in the phyloacc-job-files/cfgs/ folder, there should be one config file for every batch you wanted, and the snakemake file should be there as well (phyloacc-job-files/snakemake/run_phyloacc.smk). You should be able to execute that file to run all the phyloacc batches. See here if you need help with that: https://phyloacc.github.io/readme.html#snakemake

francicco commented 1 year ago

Hi @gwct,

Thanks for your reply. Yes the files are there, I realised that the other day, but for some reason, I get a segmentation fault when I execute them.

Loading input data and running parameters......
Loading program configurations from /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/phyloacc/phyloacc-job-files/cfgs/1-st.cfg......
Unknown parameter: THIN
Loading phylogenetic tree from /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/NeutralModelPhyloACC.mod......
./PhyloAcc.sh: line 28: 276842 Segmentation fault      PhyloAcc-ST /user/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/CNEEselectionAnalysis/phyloacc/phyloacc-job-files/cfgs/1-st.cfg

I changed the tree in the model as the output from astral from single copy genes, maybe there's something wrong with that?

ALPHABET: A C G T
ORDER: 0
SUBST_MOD: SSREV
BACKGROUND: 0.011709 0.488291 0.488291 0.011709
RATE_MAT:
  -2.355880    0.581503    1.754731    0.019646
   0.013945   -0.967485    0.911461    0.042079
   0.042079    0.911461   -0.967485    0.013945
   0.019646    1.754731    0.581503   -2.355880
TREE: (((((((((((((((((Hcyd:1.0,Hpac:1.0):0.6804018036202207,(Hheu:1.0,Htim:1.0):2.045115939943643):1.0775727658119176,Hmel:1.0):2.5845526956832514,(((Helv:1.0,Hpar:1.0):2.520043184039743,(Hhel:1.0,Hatt:1.0):1.6674287241514767):0.5943716240945114,Heth:1.0):1.2579577656795362):1.5826137755600203,(((Hism:1.0,Hnum:1.0):0.853116661526068,Hbes:1.0):1.6616901002440279,Hnat:1.0):0.8026419610317491):5.907942637561168,((Hbur:1.0,Hege:1.0):0.5681373584773383,Hwal:1.0):6.010212433245333):1.9778295292364656,(((Hxan:1.0,Hdor:1.0):3.166826849161951,Hhie:1.0):1.3420656396541348,Hheb:1.0):5.4781616843381205):1.7937053301035975,Haoe:1.0):1.6545372896321975,((((((((Hcon:1.0,Hele:1.0):1.9056103408781728,(Hsap:1.0,Hhew:1.0):0.0723574664456409):3.375750647347987,Hant:1.0):3.5979134179031544,(Hsar:1.0,Hleu:1.0):3.3139879260346636):6.661203729637612,((Hdem:1.0,Hert:1.0):6.733401891837363,Hric:1.0):0.051381042893976936):0.7606409147268492,(Hper:1.0,Hcha:1.0):6.734195227856759):5.292321714137381,((Hhor:1.0,Hcly:1.0):4.025034532470522,Htel:1.0):4.687512705547947):0.10896937344285544,(((((Heet:1.0,Hhim:1.0):0.024118267198463855,Hlat:1.0):0.1336354728775984,(Herd:1.0,Hpet:1.0):0.6285960561543454):3.400604306532743,Hher:1.0):2.414880217442409,Hhec:1.0):3.087619891558481):5.982172686827108):5.014876114295957,(((Elyb:1.0,Etal:1.0):2.442904994703678,Eali:1.0):1.4989967616218582,((Evib:1.0,Elam:1.0):4.241425963841232,Eisa:1.0):5.257831552909168):5.784125852827229):4.570218310838701,(((Avcr:1.0,Avfl:1.0):0.3414240414621923,Avpe:1.0):4.574779159236899,Djun:1.0):5.923550662892739):3.107797333591328,(((Ptel:1.0,Diul:1.0):2.146008485910313,Dpha:1.0):1.7448773782809164,Pdid:1.0):1.0411040264749596):2.3561332669746173,Smor:1.0):3.4206447730944882,(Mcin:1.0,Jcoe:1.0):1.4339057396001638):0.8566891380633856,Bany:1.0):1.0),Dple:1.0);

Any idea why? Do I need to include the ancestral node in both the tree and the alignment? Thanks a lot F

francicco commented 1 year ago

I tested another tree, a ML one and it returns the same error. F

gwct commented 1 year ago

The tree in the .mod file should be the tree with branch lengths in terms of neutral substitutions that you got from phyloFit. The coalescent/ASTRAL tree should be provided as another option in the PhyloAcc config files. If you had the interface estimate that tree, then it should be added automatically. If you had this tree beforehand, you should specify the path to it when you run the interface with the -l option. But based on your call to PhyloAcc it looks like you're just running the species tree model (no -r option specified, default mode is the species tree model), which doesn't need a coalescent tree.

The error you're seeing though may be related to #45 and #48: are your internal nodes in the tree in the mod file labeled? This is required but not well documented currently, unfortunately. There's a method for labeling the tree here: https://github.com/xyz111131/PhyloAcc_v1#trouble-shooting The ancestral nodes only need to be in the tree, not the alignment.

francicco commented 1 year ago

OOOh, I see. I'll fix these problems and I'll let you know! Thanks a lot! F

francicco commented 1 year ago

It's gonna be a problem. I generate the phyfit model with a subset of species to speed it up. with the whole data It was running for 3-weeks. I'm trying to see if I can update the model with --init-model. Any other ideas or suggestions about that? Cheers F

gwct commented 1 year ago

Hmm, I'm not sure what you mean by its gonna be a problem. If you've already run phyloFit you should be able to just label that tree that you already generated. Does that make sense?

francicco commented 1 year ago

In theory, but I generated the phylofit model using a subset of the species. I have a lot of genomes that are 5mya old. So I didn't consider them in computing the phylofit model. Now the tree in my final model has only the species I used, and not the 63 of them. Makes sense?

gwct commented 1 year ago

Ah yes, I see. Yea, your tree has to be from the full set (sorry I didn't answer that question in #47 earlier). I don't think it should run for 3 weeks. @tsackton or @xyz111131 any ideas why phyloFit would take that long?

tsackton commented 1 year ago

phyloFit should be quite fast as all it is doing is estimating substitution rates on a fixed tree. What is the phyloFit command you are running that is taking a long time?

Also, what is your input alignment - is this for example an SS file of 4d sites? Or something else?

francicco commented 1 year ago

it's a maf 4d sites. And this is the command:

phyloFit --tree $TREE --subst-mod SSREV --out-root neut_ --msa-format MAF --sym-freqs --log phyloFit_${SP}_${ITER}.log neut4d_input_$SP.maf

tsackton commented 1 year ago

Is there anything in the log file? I've never seen phyloFit take 3 weeks to run but without an error message it is tricky to figure out why it might be stalling.

tsackton commented 1 year ago

The one thing I might try is converting the maf to SS using msa_view and then trying again and seeing if that helps.

francicco commented 1 year ago

I don't have the logs anymore, I'll try it again also converting to SS. At the moment I'm reconverting the hal into maf. I hope it will finish soon, and I'll let you know Thanks a lot F

francicco commented 1 year ago

If I try to convert it with msa_view it says that the maf needs to be sorted. If I use mafSorter it asks me a reference sequence, but in my maf I have multiple ref sequence, and if I five it one it returns segmentation fault. I'm kind of stuck...

francicco commented 1 year ago

Hi @tsackton

would it be ok if I run independently for each chromosome and at the end I average across the models with phyloBoot. You did it for the conserved elements, might work for the neutral model as well, right? F

tsackton commented 1 year ago

Yes, that should be fine

francicco commented 1 year ago

I set a temporary model for phyloacc, the plotting now works. Still returns me a segmentation fault. I think I didn't quite get how to give it the coalescent tree. I tried with -l Astral.tree + --theta, only -l Astral.tree or only --theta, but still get seg fault What am I doing wrong?

informative-sites-hist informative-sites-frac-hist avg-seq-len-nogap-hist aln-len-hist

variable-informative-sites input-species-tree

francicco commented 1 year ago

Yes, that should be fine

Great @tsackton, thanks a lot! F

gwct commented 1 year ago

You only need a coalescent tree if you run the gene tree model (-r gt or -r adaptive). In the log file you sent at the beginning of the thread, you didn't specify -r so it will run in the default species tree mode (st). If that's what you want, then you don't need to worry about the coalescent tree.

If you DO want to run the gene tree model, you will have to set -r to gt or adaptive. Note that gt will take a very long time depending on how many elements and computational resources you have. With adaptive, it will pick only the elements with the lowest average site concordance factors (sCF) to run with the gene tree model. sCF cutoffs can be controlled by -scf and -s. And in this case, you could either give the tree with branch lengths in coalescent units directly as input with -l, or if none is given you will have to select --theta and the snakemake workflow will select up to 5000 of your longest elements to generate it for you (via IQ-tree gene trees and ASTRAL). This would also add to the runtime of the pipeline. Either -l OR --theta, not both. Does that make sense?

Regarding the segmentation fault: did you add labels to the internal nodes of your tree? Sorry if you said you did already and I missed it.

francicco commented 1 year ago

Thanks a lot! I a bit slow :P

Yes, I thought the tree was fully labeled, but it's actually not in some of them. I'll correct it!

Thanks really a lot! F

francicco commented 1 year ago

Does also the coalescent tree need to be annotated? I guess so, right? F

gwct commented 1 year ago

I don't recall if that is a requirement, but it probably won't hurt.

francicco commented 1 year ago

Hi @gwct,

The first batch of a test run has finished. I've been reading the manual for some insight on how to interpret the results, which seem quite complicated. Lots of information, and most of them are quite interesting but I'm still confused about which one to look at. Can you clarify a bit how to extract the info I most need? Thanks a lot F

francicco commented 1 year ago

Hi @tsackton,

So, After I averaged on chrs I rerun phyloFit with --subst-mod SSREV --sym-freqs using --init-model ave_model on the full dataset. That run for 2 days. then I corrected the resulting model with the base composition. I guess it should be correct now, right? F

gwct commented 1 year ago

So the output files listed on the website are those for each individual PhyloAcc run (https://phyloacc.github.io/readme.html#output). However, if you run the phyloacc_post.py script, it should compile almost everything for you in a single table. Basically, you want to look for loci in which M1 fits better than M0 or M2, which is done when the Bayes factors are large enough. Those loci are accelerated specifically in your target branches (though not necessarily exclusively, but a model that restricts acceleration to these lineages fits better). Then you can look at which lineages specifically are accelerated in a given locus, and the rates for those loci as well.

The output file from phyloacc_post.py contains a lot, but also has summaries and explanations of each column in the file header. So take a look at that and then let me know if you have any more specific questions!

francicco commented 1 year ago

Hi @gwct,

So, I analysed 13130. Of them, 2994 have M1 in the best.fit.model. If I understood correctly within the accelerated loci there could be taxa that do not belong to my target species, correct? Therefore I have to check in the column "accel.lineages.m1" and exclude loci that have also non-target taxa in it if I want taxa that exclusively are accelerated in my target. Is my interpretation correct?

Also, my input was made of 227,778 loci, and only 13,130 were analysed. I guess phyloacc tests only the informative sites, correct? Thanks a lot!

F

gwct commented 1 year ago

Yes, it doesn't really make sense to run PhyloAcc (or any rate estimation program) on loci without any informative sites since there will inevitably be no accelerations and M0 will be favored. And since we're working on conserved elements, it is definitely possible for many loci to not have any variation at all. In another analysis I've talked with someone about, there are no informative sites in any of their loci, though I don't know how many loci they had.

Though this also depends on the number of species and their divergence times from one another -- as each increases I would expect the number of loci with informative sites to increase as well. I remember you saying you had 60 some species? I would probably expect more loci with informative sites in this case. As a comparison, with the bird data (284001 loci with 43 species), there are only 252 loci without informative sites. What's the approximate divergence time of oldest split in your tree? If it is more than a few million years (just guessing here), then this may be something I would want to look at in the code to confirm informative sites are calculated correctly. The other possibility is that you over-trimmed your alignments. If you use something like G-Blocks, for instance, the default trimming settings oftentimes remove phylogenetic informative sites.

francicco commented 1 year ago

I'm sorry, my bad! the script that should concatenate the data had a but and it returned the wrong CEs, I fixed it. Now:

# 04.03.2023  14:55:31  Reading input bed file                  Success: 227778 loci read               3.82391             0.3684          1774.16406               8392.47656          
# 04.03.2023  14:55:36  Partitioning alignments by locus        Success: 227778 alignments partitioned  9.04601             5.22179         4549.83203               11171.12891         
# 04.03.2023  14:57:31  Calculating alignment stats             Success: 227778 alignments processed    124.17412           115.12702       4921.36719               11538.97266         
# INFO: 213 loci have 0 informative sites and will be removed from the analysis.
# 04.03.2023  14:57:32  Writing: phyloacc-aln-stats.csv         Success: align stats written            125.48634           1.30807         4921.40625               11546.97266    

It will run for a week or so depending on how the cluster is free.

Interestingly enough there is a similar amount of CNEEs 227778 vs 284001 with similarly discarded loci 213 vs 252.

I didn't trim the alignments. They are as they come directly from the alignment. To give you some context, the branch I'm interested in is ~12 mya and most of the taxa are in ~28 mya with the root at ~90.

I'll let you know as soon as the run is finished.

Thanks a lot a sorry again for the confusion. F

francicco commented 1 year ago

Hi @gwct,

The analysis has finished! And it looked faster than I thought. You probably need to guide me through the results so I can understand better what's going on, if you don't mind. I'd really appreciate it.

So phyloacc analysed 227,565 intergenic CEs (I put on the side in intronic ones atm). If I look at the elem_lik.txt 113,008 elements evolved under M0 (~50%), 65,216 under M1 (~29%), and 49,341 under M2 (~22%). Now I'd be interested in the CEs evolving under M1, but I'm confused by the fields conserved.lineages.m1, accel.lineages.m1, conserved.lineages.m2, accel.lineages.m2. Specifically, I don't if M1 CEs should not contain any other species not in my target group.

Also, if i'd want to reproduce this figure image how would I generate the branch length? Did you guys used iqtree on those elements using the species tree phylogeny?

Is there anything else that could be interesting to look at? Cheers F

francicco commented 1 year ago

Oh, just remembered.

I'm also getting this error

# 04.05.2023  14:12:08  Generating summary plots                In progress...                          Traceback (most recent call last):
  File "/user/home/tk19812/.conda/envs/phyloacc-env/bin/phyloacc_post.py", line 494, in <module>
    globs = PLOT.genPlotsPost(globs);
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/phyloacc_lib/plot.py", line 394, in genPlotsPost
    plt.hist(bf2s, color=PC.coreCol(pal="wilke", numcol=1, offset=2)[0], bins="sturges", edgecolor="#999999");
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/matplotlib/pyplot.py", line 2573, in hist
    return gca().hist(
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/matplotlib/__init__.py", line 1423, in inner
    return func(ax, *map(sanitize_sequence, args), **kwargs)
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/matplotlib/axes/_axes.py", line 6737, in hist
    m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
  File "<__array_function__ internals>", line 180, in histogram
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/numpy/lib/histograms.py", line 793, in histogram
    bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/numpy/lib/histograms.py", line 396, in _get_bin_edges
    first_edge, last_edge = _get_outer_edges(a, range)
  File "/user/home/tk19812/.conda/envs/phyloacc-env/lib/python3.10/site-packages/numpy/lib/histograms.py", line 315, in _get_outer_edges
    raise ValueError(
ValueError: supplied range of [-682.54444, inf] is not finite
gwct commented 1 year ago

Correct, loci that fit M1 best should not have non-target lineages listed as accelerated in the accel.lineages.m1 column. M1 specifically restricts other lineages from being in the accelerated state.

Which model results you want to use depend on the question you are asking: do you want to identify loci in which the best fitting model is one that restricts accelerations to your target lineages? If so, then use the loci that fit M1. Or do you want to know the loci for which only your target branches had acceleration given a model that allows acceleration everywhere? If that's the case, then you could take the loci that fit M2 and look for ones that only have your target lineages in the accel.lineages.m2 column. We think that if you have a specific hypothesis to test with target lineages, then it is better to use the M1 results (see the section on page 32-33 of the pre-print for PhyloAcc-GT).

For the figures, I'm not sure exactly how those were created. I think the code is similar to what is here, but that is likely based on old output formats. I usually make figures like that in R with something like ggtree or phytools .

I hope that is helpful.

gwct commented 1 year ago

For the error, it looks like matplotlib or numpy can't handle the inf value. I'm not sure why an inf would be there so I'll have to look into it more.

francicco commented 1 year ago

Cool, understood. I took M1, so accelerated only in my target, & where the stem of my target species IS present in accel.lineages.m1. That should indicate the loci that are accelerated only at the stem of my target species. Correct?

I'll look at those scripts. I was just curious how those topologies were generated, because, as conserved elements, they should have limited phylogenetic signal.

Cheers F

gwct commented 1 year ago

Oh, right you mentioned the branch lengths. That is definitely confusing. So as I understand it, you would multiply the branch length in your input tree, presumably in units of relative number of neutral substitutions, by the posterior median of the conserved or accelerated substitution rates for the model you're plotting (e.g. conserved.rate.m1 and accel.rate.m1 for M1).

So if I was plotting a tree for a locus under M1, I would multiply all the branch lengths in my input tree that are in the accel.lineages.m1 column by the number in the accel.rate.m1 column and all the branches in conserved.lineages.m1 by conserved.rate.m1. And then repeat for the next locus using the values in those columns for that locus. Does that make sense?

francicco commented 1 year ago

Oh, ok! I'd never have thought about that! Thanks a lot! F

francicco commented 1 year ago

Hi @gwct,

I'm summing up the results, which I think look pretty good. There are over 2500 elements that seem to be accelerated, with several GO term enriched. I was wondering if with the current analysis would be possible to do some comparisons between the target branch and other internal branches, for example, would it be correct if I take the elements that have M2 as the best model, and take the lineages in accel.lineages.m2 field? I'd like to establish if the number of accelerated elements, at my target branch, is more to be expected than by chance. In the ratite paper, you guys do a permutation test, but I think my case is a bit different since I'm not looking for convergence.

Also, is there a quick way to spot ultraconserved elements from elem_lik.txt, I guess they at least needs to be under M0 model.

Thanks a lot F

gwct commented 1 year ago

Sorry for the delay. I think if you compare to lineages that are accelerated under M2 you'd also have to only look at loci where your target lineages are accelerated in M2. In other words, looking at M2 only, pick out loci where your target lineages are accelerated, then pick another random set of lineages and count the same. The other option would be to pick your random set of lineages and re-run PhyloAcc with those lineages as targets. Then compare that to the results when your lineages of interest were targets using M1. I think both methods are fine. The first one (using M2) is probably faster, and you could easily do a lot of random sampling of lineages to build an empirical null distribution of expected number of times a lineage or set of lineages is accelerated. Though then you may have a different definition of accelerated lineages for your targets for different analyses: identifying loci of interest as those that M1 fits best, and identifying whether these lineages are accelerated more often than others using M2.

For ultraconserved elements I'm not too sure. Maybe you could make a distribution of conserved.rate.m0 for all loci where M0 fits best and pick loci that fall in the lower end of that distribution. But I'm not 100% sure this does what you wanted.

francicco commented 1 year ago

No worries @gwct,

I think that for this paper might be enough, the referee should be happy, but it's something I'm going to explore more in future projects. It seems a very interesting approach. Cheers F

gwct commented 1 year ago

Good luck!