Closed Silverfoxcome closed 3 years ago
Hi! I'm back to say that probably the problem was with my headers xD
I changed them from:
>rps12_join{[69317:69430](-),[92912:93143](-),[92343:92371](-)}
to something like:
>lcl|mexicana.1_prot_rps12.zmex [mexicana]
And get_phylomarkers worked!
Now I kind of understand why it told me that:
Input f?aed files do not contain the same number of strains and a single instance for each strain
But if you could explain me a bit more, I'll be very grateful :)
I should have put more attention to the headers like you recommend in the manual!
Thanks a lot!
Hi, I haven't run it with your sequences, but for clusters produced by GET_HOMOLOGUES-EST you should add flag -f EST. The default (STD) works for GET_HOMOLOGUES, which produces different headers:
-f
Hope this helps, Bruno
Also make sure that the header IDs for each sequence are unique. GET_PHYLOMARKERS will complain otherwise. Let us know if you need any further assistance. Best
Hi, I haven't run it with your sequences, but for clusters produced by GET_HOMOLOGUES-EST you should add flag -f EST. The default (STD) works for GET_HOMOLOGUES, which produces different headers:
-f GET_HOMOLOGUES cluster format <STD|EST> [default: STD]
Hope this helps, Bruno
I didn't do my analysis with the flag -f! I will add it and repeat my analysis. Thanks for the help! :D
Also make sure that the header IDs for each sequence are unique. GET_PHYLOMARKERS will complain otherwise. Let us know if you need any further assistance. Best
Yes! I changed the headers to make them unique and then GET_PHYLOMARKERS worked as in my previous analysis! Thanks for the help!!
I'm working with chloroplast genomes and I want to infer phylogenetic relationships between ~ 20 sequence. Some are from the same species and other aren't but all of them are from the same genus, Zea.
Doing the analysis with the best annotated chloroplast genomes from NCBI, I got aprox 65 clusters. And applying GET_PHYLOMARKERS, I only get 10 (sometimes 8) top phylo markers (of which, the programs tells me, 4 are the best). In your opinion, is this number (10 or 8) enough to make a good phylogenetic analysis for chloroplast genomes of plants of the same genus (some from different species and 4 from the same species)?
Thanks a lot for all your help!
Some ideas: 1) Check on what grounds were the other genes removed, perhaps some thresholds can be relaxed 2) On our experience with Brachypodium plastids we noticed cp genes are very conserved and sometimes have no signal. That might explain your low number of genes. In that case it might be best to concatenate all single-copy loci and start from there. 3) If the assemblies allow it you can also do whole genome alignments so that no signal is lost.
Any more ideas @vinuesa ? Bruno
Some general comments on the issue:
Some ideas:
1. Check on what grounds were the other genes removed, perhaps some thresholds can be relaxed 2. On our experience with _Brachypodium_ plastids we noticed cp genes are very conserved and sometimes have no signal. That might explain your low number of genes. In that case it might be best to concatenate all single-copy loci and start from there. 3. If the assemblies allow it you can also do whole genome alignments so that no signal is lost.
Any more ideas @vinuesa ? Bruno
Some general comments on the issue:
1. GET_PHYLOMARKERS was designed to perform phylogenetic analyses on multilocus core-genome data for different species. It was not designed for population trees. You can get some descriptive statistics, running in the pop-gen mode -R 2 2. The pipeline will by default filter out/discard CDS alignments for which the resulting gene tree has a mean bipartition support value < 70%. You can relax this threshold by setting -m to a lower value (i.e. 0.6) 3. You could also use a higher -k value kde stringency (0.7-1.6 are reasonable values; less is more stringent), for example 1.6 4. I would estimate the phylogeny with the top markers selected with the slightly relaxed cutoff values 5. Just concatenating everything is certainly another option given the low level of polymorphism found among cpDNA sequences, particularly since I guess you won't expect too much recombination. Hope this helps, cheers
Hi!! Thank you both so much for your ideas for my thesis project!! I really value a lot the ideas you are giving me to make a better phylogenetic analysis!
Most of the chloroplast genomes I'm using are from different species, only 4 are from the same species (but all are from the same genus, Zea). That's why I thought I could use the mode -R 1. I'll run the mode -R 2 like you recommend me.
In that case it might be best to concatenate all single-copy loci and start from there.
- Just concatenating everything is certainly another option given the low level of polymorphism found among cpDNA sequences, particularly since I guess you won't expect too much recombination.
Then I'll try to concatenate all the CDS from the 63 cluster homologous that GET_HOMOLOGUES-EST gaves me :D I've read about making a phylogenetic analysis with all the concatenated core genes of the cp genomes but I was not sure if I should do it.
Would you recommend me to also try and concatenate all the gene sequences instead of only the CDS? That was something I read in a paper but I was not sure of using the whole sequence of the gene instead of the CDS (only 23 genes have introns).
1. Check on what grounds were the other genes removed, perhaps some thresholds can be relaxed 2. The pipeline will by default filter out/discard CDS alignments for which the resulting gene tree has a mean bipartition support value < 70%. You can relax this threshold by setting -m to a lower value (i.e. 0.6) 3. You could also use a higher -k value kde stringency (0.7-1.6 are reasonable values; less is more stringent), for example 1.6 4. I would estimate the phylogeny with the top markers selected with the slightly relaxed cutoff values
I'm my logs I see that after the 63 clusters are determined:
the phi recombination test says that "WARNING: the 63 alignments with too few informative sites for the Phi test to work on ..." so I suppose that all past for lack of informative sites ToT
A lot of genes are discarded (38) in the part of the parallel IQ-TREE runs to estimate gene trees.
>>> wrote file IQT_DNA_gene_tree_Tmedium_stats.tsv ...
>>> wrote file iqtree_output_files.tgz ...
[20:53:10] # running compute_MJRC_tree treefile I ...
>>> Wrote file IQT_MJRC_tree.nwk ...
[20:53:10] # counting branches on 65 gene trees ...
>>> wrote file no_tree_branches.list ...
[20:53:11] >>> will remove 423_lcl-NC_001666.2_cds_NP_043025.2_23_cdnAln.fasta* because it has < 5 branches
[20:53:11] >>> will remove 425_lcl-NC_001666.2_cds_NP_043024.1_22_cdnAln.fasta* because it has < 5 branches
[20:53:11] >>> will remove 430_lcl-NC_001666.2_cds_NP_043049.1_47_cdnAln.fasta* because it has < 5 branches
[20:53:11] >>> will remove 434_lcl-NC_001666.2_cds_NP_043033.1_31_cdnAln.fasta* because it has < 5 branches
.
.
.
[20:53:12] >>> will remove 528_lcl-NC_001666.2_cds_NP_043042.1_40_cdnAln.fasta* because it has < 5 branches
>>> WARNING: there are 38 trees with < 1 internal branches (no real trees) that will be discarded ...
Perhaps I should relax this threshold to T low?
[20:53:16] # making dir kde_ok/ and linking 27 selected files into it ...
[20:53:16] # labeling 27 gene trees in dir kde_ok/ ...
[20:53:17] # computing tree support values ...
[20:53:19] # writing summary tables ...
>>> wrote file sorted_aggregated_support_values4loci_ge70perc.tab ...
[20:53:19] # making dir top_10_markers_ge70perc and moving 10 top markers into it ...
That's how I'm left with only 10 top genes :')
I'll lower all these threshold and see what I get :D
- If the assemblies allow it you can also do whole genome alignments so that no signal is lost.
I did that but I got low bootstrap values, but I think that the absence of intermediate taxa could be a problem (also here, with the CDS) because I have too little genome samples and there is a gap in my phylogeny ... (an advisor told me something along those lines, that I needed to add more cp genomes to my analysis ... but I have already added all that are in the NCBI database plus the 3 cp genomes I assembled (they are like 20 in total). Or perhaps, like you said, my CDS's are just very conserved like is characteristic of chloroplast genomes.
But I will still add the whole cp genome alignment analysis and discuss that part :)
I was also advised to look for the best cp markers and that's how I found get_homologues-est
and get_phylomarkers
:D
Even if, at the end, I don't get good bootstrap values for my tree, I want to make a good phylogenetic analysis and perhaps identify some good cp markers for future works. I think this could add some value to my thesis project :')
Thank you so much to both of you for all your advice. I really appreciate the time both of you took to read my questions and give me advice :')
Your output indicates that most genes (38) are discarded for being too conserved, producing trees with <5 branches. This confirms my previous experience that cps are highly conserved and that you might want to concat all genes or even produce multiple genome alignments. Anyway, from those top 10 genes GET_PHYLOMARKERS allow you to do independent tree searches under the best-fit model, computing ultrafast bootstrapp and aproximate Bayes branch support values, as explained at https://vinuesa.github.io/get_phylomarkers/#brief-presentation-and-graphical-overview-of-the-pipeline
Please let us know how that goes, Bruno
You should probably add some outgroup cp sequences to increase your taxon sampling and overall alignment/locus signal. Remember that trees with < 5 branches are trivial from a phylogenetic perspective, as there is only 1 topology. Therefore GET_HOMOLOGUES elimnates loci producing gene-trees with < 5 branches. Would be useful that you read de documentation in more detail. Best
Good day GET_PHYLOMARKERS team
First, thanks a lot for this tool. It's really helpful for what I want to do in my thesis project! I'm working with the docker image you provided, which is really handy and easy to use :)
I'm interested in doing a phylogenomic analysis with cloroplast genome sequences. Due to these genomes having ~ 20 genes with introns, I have been using get_homologues-est. Some genomes are from the same species (aprox. 4), but all these genomes are from the same genus.
Until yesterday, I have been testing get_homologues-est and get_phylomarkers with the data that I downloaded from NCBI. I downloaded the CDS and proteins sequences for each genome and changed their name to:
, to have the twin files that get_homologues-est requires (although in the manual, it says that the protein file is optional?).
I did not change the headers of the sequences. I wanted to see if the program could work with them as they are (I read in the manual that the headers have to have a particular format so I was worried that the programs will not accept them). They look like these:
But both pipelines (get_homologues-est and get_phylomarkers) worked great! I was getting my core genes and my top phylomarkers. The problem started yesterday, when I added the twin files (.fna and .faa) of 2 genomes I assembled and annotated using cpGavas2. I was now working with 15 genomes.
The files look like these inside:
Both have the same number of sequences: 82 in fna and faa in one genome, and 81 in fna and faa in the other.
Everything worked fine in get_homologues-est, which I ran with the parameters:
And compare_clusters.pl:
I got 63 clusters (core_BCM directory).
Then I started the get_phylomarkers pipeline with these parameters:
And here it showed me this message (I think the program has a problem with the data of the 2 genomes I just added ...):
The sequences I recently added have both have the same number of sequences in the twin files: 82 in fna and faa in one genome, and 81 in fna and faa in the other.
I'm sure the error message is very helpful but I'm not sure how to check what is wrong ...
Please, any help well be very much apreciated.
Kind regards