Closed Bio1nform closed 9 months ago
I installed the V==2.0.23. in Conda.
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
Still some error.
19:28:57 INFO This is wgd v2.0.23 cli.py:32
19:29:08 INFO tmpdir = cli.py:483
wgdtmp_d51006c1-8f52-4211-9f39-423941314e0c
19:29:13 INFO Analysing family GF00000001 core.py:2873
19:29:13 INFO Analysing family GF00000002 core.py:2873
19:29:13 INFO Analysing family GF00000003 core.py:2873
19:29:13 INFO Analysing family GF00000004 core.py:2873
19:29:27 WARNING Stripped alignment length == 0 for GF00000004 codeml.py:225
INFO Analysing family GF00000005 core.py:2873
19:29:36 WARNING Stripped alignment length == 0 for GF00000002 codeml.py:225
INFO Analysing family GF00000006 core.py:2873
19:29:39 WARNING No codeml result for GF00000003 due to no codeml.py:234
resolved nucleotides
19:55:59 WARNING No codeml result for GF00006547 due to no codeml.py:234
resolved nucleotides
19:56:16 INFO Saving to wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv cli.py:493
19:56:18 INFO Making plots cli.py:495
INFO No valid Ks values for plotting cli.py:497
I installed the V==2.0.23. in PYPI.
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
15:20:53 INFO This is wgd v2.0.23 cli.py:32 15:21:10 INFO tmpdir = cli.py:483 wgdtmp_adab2f7f-fb6b-407e-8083-5643b9b4a9fc 15:21:14 INFO Analysing family GF00000001 core.py:2873 15:21:14 INFO Analysing family GF00000002 core.py:2873 15:21:14 INFO Analysing family GF00000003 core.py:2873 15:21:14 INFO Analysing family GF00000004 core.py:2873 15:21:15 INFO Analysing family GF00000005 core.py:2873
Now i get the following error. error.txt
This error shows something wrong with the alignment of GF00000001. Could you find the tmp dir for this family and share me with the GF00000001.cdsaln
GF00000001.codeml
GF00000001.ctrl
pro.aln
files.
I installed the V==2.0.23. in Conda.
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
Still some error.
19:28:57 INFO This is wgd v2.0.23 cli.py:32 19:29:08 INFO tmpdir = cli.py:483 wgdtmp_d51006c1-8f52-4211-9f39-423941314e0c 19:29:13 INFO Analysing family GF00000001 core.py:2873 19:29:13 INFO Analysing family GF00000002 core.py:2873 19:29:13 INFO Analysing family GF00000003 core.py:2873 19:29:13 INFO Analysing family GF00000004 core.py:2873 19:29:27 WARNING Stripped alignment length == 0 for GF00000004 codeml.py:225 INFO Analysing family GF00000005 core.py:2873 19:29:36 WARNING Stripped alignment length == 0 for GF00000002 codeml.py:225 INFO Analysing family GF00000006 core.py:2873 19:29:39 WARNING No codeml result for GF00000003 due to no codeml.py:234 resolved nucleotides
19:55:59 WARNING No codeml result for GF00006547 due to no codeml.py:234 resolved nucleotides 19:56:16 INFO Saving to wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv cli.py:493 19:56:18 INFO Making plots cli.py:495 INFO No valid Ks values for plotting cli.py:497
Conda wouldn't install paml v4.9j
automately, some other versions instead. Could you double check the paml
version in your conda environment for wgd
?
I installed the V==2.0.23. in PYPI. wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd 15:20:53 INFO This is wgd v2.0.23 cli.py:32 15:21:10 INFO tmpdir = cli.py:483 wgdtmp_adab2f7f-fb6b-407e-8083-5643b9b4a9fc 15:21:14 INFO Analysing family GF00000001 core.py:2873 15:21:14 INFO Analysing family GF00000002 core.py:2873 15:21:14 INFO Analysing family GF00000003 core.py:2873 15:21:14 INFO Analysing family GF00000004 core.py:2873 15:21:15 INFO Analysing family GF00000005 core.py:2873 Now i get the following error. error.txt
This error shows something wrong with the alignment of GF00000001. Could you find the tmp dir for this family and share me with the
GF00000001.cdsaln
GF00000001.codeml
GF00000001.ctrl
pro.aln
files.
These are the only files that are present in GF00000001.
Manually run MAFFT
is no problem on your pro.fasta.txt
. It seems something went wrong during the MAFFT
analysis. Is MAFFT
working properly in your environment on huge family like GF00000001
? One suspect is that you didn't give enough cpu to the job.
I installed the V==2.0.23. in Conda. wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd Still some error. 19:28:57 INFO This is wgd v2.0.23 cli.py:32 19:29:08 INFO tmpdir = cli.py:483 wgdtmp_d51006c1-8f52-4211-9f39-423941314e0c 19:29:13 INFO Analysing family GF00000001 core.py:2873 19:29:13 INFO Analysing family GF00000002 core.py:2873 19:29:13 INFO Analysing family GF00000003 core.py:2873 19:29:13 INFO Analysing family GF00000004 core.py:2873 19:29:27 WARNING Stripped alignment length == 0 for GF00000004 codeml.py:225 INFO Analysing family GF00000005 core.py:2873 19:29:36 WARNING Stripped alignment length == 0 for GF00000002 codeml.py:225 INFO Analysing family GF00000006 core.py:2873 19:29:39 WARNING No codeml result for GF00000003 due to no codeml.py:234 resolved nucleotides 19:55:59 WARNING No codeml result for GF00006547 due to no codeml.py:234 resolved nucleotides 19:56:16 INFO Saving to wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv cli.py:493 19:56:18 INFO Making plots cli.py:495 INFO No valid Ks values for plotting cli.py:497
Conda wouldn't install
paml v4.9j
automately, some other versions instead. Could you double check thepaml
version in your conda environment forwgd
?
I export the path of paml v4.9j.
export PATH=$PATH:/home/software/GENOMETOOLS/PAML/paml4.9j/bin
It worked with the earlier version. GF00000001 for conda. GF00000001_ks.txt.csv pro.aln.txt pro.fasta.txt
I opened a new virtual environment and reinstalled v2.0.23
and wgd ksd
runs fine. I can't reproduce your error. Not sure if other users had the same problem only with v2.0.23
.
Hi, Both the PYPI and the conda version works fine.
When i run: wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_sync
The output figure seems bit different.
Two types of dotplot were inferred, one is in the unit of gene (Number of genes), and one is in the unit of base (Number of bases). The file name should contain this piece of information.
Two types of dotplot were inferred, one is in the unit of gene (Number of genes), and one is in the unit of base (Number of bases). The file name should contain this piece of information.
I am not getting this figure.
wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv --extraparanomeks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -sp speciestree.nw --reweight -ap wgd_sync/iadhore-out/anchorpoints.txt -o wgd_viz_mixed_Ks_elmm --spair "Aquilegia_coerulea;Protea_cynaroides" --spair "Aquilegia_coerulea;Vitis_vinifera" --spair "Aquilegia_coerulea;Acorus_americanus" --spair "Aquilegia_coerulea;Aquilegia_coerulea" --gsmap wgd_globalmrbh_ks/gene_species.map --plotkde --plotelmm
From website.
My output the peaks are smaller.
Two types of dotplot were inferred, one is in the unit of gene (Number of genes), and one is in the unit of base (Number of bases). The file name should contain this piece of information.
I am not getting this figure.
It's simple dotplot in oxford grid. The gray dots are homologous gene pairs while red dots are anchor pairs. The transparency of dots can be manually set.
wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv --extraparanomeks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -sp speciestree.nw --reweight -ap wgd_sync/iadhore-out/anchorpoints.txt -o wgd_viz_mixed_Ks_elmm --spair "Aquilegia_coerulea;Protea_cynaroides" --spair "Aquilegia_coerulea;Vitis_vinifera" --spair "Aquilegia_coerulea;Acorus_americanus" --spair "Aquilegia_coerulea;Aquilegia_coerulea" --gsmap wgd_globalmrbh_ks/gene_species.map --plotkde --plotelmm
From website.
My output the peaks are smaller.
Both node-averaged and node-weighted plots will be produced. Could you show both?
wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv --extraparanomeks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -sp speciestree.nw --reweight -ap wgd_sync/iadhore-out/anchorpoints.txt -o wgd_viz_mixed_Ks_elmm --spair "Aquilegia_coerulea;Protea_cynaroides" --spair "Aquilegia_coerulea;Vitis_vinifera" --spair "Aquilegia_coerulea;Acorus_americanus" --spair "Aquilegia_coerulea;Aquilegia_coerulea" --gsmap wgd_globalmrbh_ks/gene_species.map --plotkde --plotelmm From website.
My output the peaks are smaller.
Both node-averaged and node-weighted plots will be produced. Could you show both?
Aquilegia_coerulea_Corrected.ksd.weighted.svg
I think it might be linked with the data we used. What is your source of the Aquilegia coerulea
cds file in use?
I used CDS from phytozome (https://phytozome-next.jgi.doe.gov/info/Acoerulea_v3_1)
The NCBI version of Aquilegia coerulea CDS has several genes with duplicate names. wgd cannot handle duplicated names.
Aqcoe0131s0003.1 ATGTATATTAAATATGTCACAACCAAAAAAAACTATTGTACTGTTACATATATGCAGGGGGGTACATACAGTATACAAGG ACGAATCCAGGGGGTGCACGGTGCAACCGCACCCCCAAAATTTGAAATTTTCATTATTTTCCCTATGTTTTTTTGTACAT ATATCATTATTTCCCTATGCTTTTTGCACGTATATAAAAAATTTAGCTTAAATATGTAG Aqcoe0131s0002.1 ATGGTAGATATTACAATTTCTAGGGCACGTTGGACGGAATCAAGATCAAAACTCAAAAAAGATACTATACGACCTTTAAT TACTCTTTCAGAGCCAAATCCGTACTACATGGTGTCTTTACGCATTGGTACA Aqcoe1729s0001.1
Are you using this file Acoerulea_322_v3.1.cds_primaryTranscriptOnly.fa.gz
?
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks gd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_sync
I used Acoerulea_322_v3.1.cds.fa.gz (43550). To match to mRNA in .gff3 file (43550).
By principle, only one sequence per gene should be used in the construction of whole paranome, that way each split in the tree can represent one gene duplication event. If you use the whole alternative CDS instead of only the primary ones, how do you interpret the biological meaning of each bipartition in the tree?
The issue is the gene names from the Acoerulea_322_v3.1.cds_primaryTranscriptOnly.fa.gz does not match the gff3 files.
Here is the gff3 file:
Chr_01 phytozomev11 gene 2657 4987 . + . ID=Aqcoe1G000100.v3.1;Name=Aqcoe1G000100;ancestorIdentifier=Aquca_009_00001.v1.1 Chr_01 phytozomev11 mRNA 2657 4987 . + . ID=Aqcoe1G000100.1.v3.1;Name=Aqcoe1G000100.1;pacid=33083967;longest=1;ancestorIdentifier=Aquca_009_00001.1.v1.1;Parent=Aqcoe1G000100.v3.1 Chr_01 phytozomev11 five_prime_UTR 2657 2841 . + . ID=Aqcoe1G000100.1.v3.1.five_prime_UTR.1;Parent=Aqcoe1G000100.1.v3.1;pacid=33083967 Chr_01 phytozomev11 five_prime_UTR 4435 4439 . + . ID=Aqcoe1G000100.1.v3.1.five_prime_UTR.2;Parent=Aqcoe1G000100.1.v3.1;pacid=33083967 Chr_01 phytozomev11 CDS 4440 4691 . + 0 ID=Aqcoe1G000100.1.v3.1.CDS.1;Parent=Aqcoe1G000100.1.v3.1;pacid=33083967 Chr_01 phytozomev11 three_prime_UTR 4692 4987 . + . ID=Aqcoe1G000100.1.v3.1.three_prime_UTR.1;Parent=Aqcoe1G000100.1.v3.1;pacid=33083967 Chr_01 phytozomev11 gene 3331 3855 . + . ID=Aqcoe1G000200.v3.1;Name=Aqcoe1G000200 Chr_01 phytozomev11 mRNA 3331 3855 . + . ID=Aqcoe1G000200.1.v3.1;Name=Aqcoe1G000200.1;pacid=33082500;longest=1;Parent=Aqcoe1G000200.v3.1 Chr_01 phytozomev11 five_prime_UTR 3331 3563 . + . ID=Aqcoe1G000200.1.v3.1.five_prime_UTR.1;Parent=Aqcoe1G000200.1.v3.1;pacid=33082500 Chr_01 phytozomev11 CDS 3564 3812 . + 0 ID=Aqcoe1G000200.1.v3.1.CDS.1;Parent=Aqcoe1G000200.1.v3.1;pacid=33082500 Chr_01 phytozomev11 three_prime_UTR 3813 3855 . + . ID=Aqcoe1G000200.1.v3.1.three_prime_UTR.1;Parent=Aqcoe1G000200.1.v3.1;pacid=33082500
Here are the sequence:
Aqcoe1G000100.1 ATGAACATGGGGGACCCATCTAAACTACATGTTAAGGTCAGATTCTGCCTTGCATCAGAACTCTATTGTTGTGTCGATAC GAGCAAAGGTGCTTTATCTGAACGGCTGGTTTCAATTAAAGAGGAAAGTATGTGCATACTCAAAGATTTTATCACCAAAC ACAATGTTCCCACTGACATCCCTGAAGAACTTTCTGAAGCTTCTGAAGACGATGACGAAGTCTCTGAGAATCCTCCTAAG AAACGAAAATGA Aqcoe1G000200.1 ATGTGTGGCATTGTGTGCGCATTAGGATTCATTCCTTCTGGGGGCACATTACCAGAACATAAATGGTTTTTCGAATTTGA CTCCAGCTCCCACTCTTCTAGCTCAGAAACTAAATTGCTGAGTTTTCTTAAATCTTTGGAGCTCCCTGCATCCTCAATTA GCATTCCACCCAATGGTGGTTGTTGTGTCATAAAAGGAACTTCAGGAGTTGAATGGGAAGCAAATATATTTAATTGTTCA CTTGGTTGA
I need to remove .1 at the end of fasta header. And if there is any name duplication. Wgd wont work. I would have to remove the sequence with duplicated names to run.
Duplicated gene names are normally not allowed, since each gene should have a unique name. Could you use -f mRNA
and -a Name
for extracting the gene names?
I used -f mRNA -a Name it works.
The mRNA number and the gene number does not match. genes (Acoerulea_322_v3.1.cds_primaryTranscriptOnly.fa.gz): 30023 mRNA (Acoerulea_322_v3.1.cds.fa.gz: 43550
In gff3 the genes (30023) and mRNA (43550).
OK, but now we know the difference between our results comes from the CDS data we used.
I am getting this error. Can you please take a look into it? I do not know what went wrong? wgd ksd wgd_globalmrbh
/home/.conda/envs/wgd223_38/lib/python3.8/site-packages/Bio/Seq.py:2855: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
warnings.warn(
Traceback (most recent call last):
File "/home/.conda/envs/wgd223_38/bin/wgd", line 10, in dataset
input should have multiple elements.")
ValueError: dataset
input should have multiple elements.
What is the complete command that you used?
wgd ksd wgd_globalmrbh/global_MRBH.tsv --extraparanomeks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -sp speciestree.nw --reweight -o wgd_globalmrbh_ks --spair "Aquilegia_coerulea;Protea_cynaroides" --spair "Aquilegia_coerulea;Vitis_vinifera" --spair "Aquilegia_coerulea;Acorus_americanus" --spair "Aquilegia_coerulea;Aquilegia_coerulea" --plotkde -ap wgd_syn/iadhore-out/anchorpoints.txt
It seems you forgot to give the input cds files.
Still the same. I removed .1 from Aqcoe1G000200.1. What do you think is the cause? ValueError: dataset input should have multiple elements.
These are the only outputs. gene_species.map global_MRBH.tsv.ks.tsv
wgd ksd wgd_globalmrbh_G/global_MRBH.tsv --extraparanomeks wgd_ksd_G/Aquilegia_coerulea.tsv.ks.tsv -sp speciestree.nw --reweight -o wgd_globalmrbh_ks_G --spair "Aquilegia_coerulea;Protea_cynaroides" --spair "Aquilegia_coerulea;Vitis_vinifera" --spair "Aquilegia_coerulea;Acorus_americanus" --spair "Aquilegia_coerulea;Aquilegia_coerulea" Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera --plotkde -ap wgd_sync_G/iadhore-out/anchorpoints.txt
Hi I see that the speciestree somehow affects the figure output. See the arrows in the figure.
(((Acorus_americanus,Aquilegia_coerulea),Protea_cynaroides),Vitis_vinifera);
The original one
(((Vitis_vinifera,Protea_cynaroides),Aquilegia_coerulea),Acorus_americanus);
How did you make these speciestree file? Was it external source, if so what genes input did you use to create the file?
Thanks
The relationship of outgroup and ingroup species determines the result of substitution rate correction. When you change the species tree which alters such relationship, the result will change. I followed APG IV.
Sorry for the naive question, how did you get the tree file? Do i download from APG IV? APG IV tree is huge tree.
Thanks
I used the updated information in the APG IV web
wgd ksd wgd_globalmrbh/global_MRBH.tsv --extraparanomeks wgd_ksd/Hap1.tsv.ks.tsv -sp speciestree.nw --reweight -o wgd_globalmrbh_ks3 --spair "Hap1;Malus_domestica" --spair "Hap1;Araport_thalania11" --spair "Hap1;Vitis_vinifera" --spair "Hap1;Oryza_sativaJ" --spair "Hap1;Hap1" Hap1 Malus_domestica Araport_thalania11 Vitis_vinifera Oryza_sativaJ --plotkde -ap wgd_syn/iadhore-out/anchorpoints.txt
I cannot see the other peaks, what could be the reason? Thanks
Hi, This is great tool, i have used version 1. Now working with version2. I managed to install with conda, however i am getting following error
wgd -h Usage: wgd [OPTIONS] COMMAND [ARGS]... wgd v2 - Copyright (C) 2023-2024 Hengchi Chen Contact: heche@psb.vib-ugent.be Options: -v, --verbosity [info|debug] Verbosity level, default = info. -h, --help Show this message and exit. Commands: dmd All-vs-all diamond blastp + MCL clustering. focus Multiply species RBH or c-score defined orthologous family's gene... ksd Paranome and one-to-one ortholog Ks distribution inference... mix Mixture modeling of Ks distributions. peak Infer peak and CI of Ks distribution. syn Co-linearity and anchor inference using I-ADHoRe. viz Visualization of Ks distribution or synteny
wgd dmd 09:04:59 INFO This is wgd v1.2 cli.py:32 Traceback (most recent call last): File "/home/.conda/envs/WGD/bin/wgd", line 10, in
sys.exit(cli())
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(args, kwargs)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(args, kwargs)
File "/home/.conda/envs/WGD/lib/python3.6/site-packages/cli.py", line 113, in dmd
_dmd(kwargs)
File "/home/.conda/envs/WGD/lib/python3.6/site-packages/cli.py", line 116, in _dmd
from wgd.core import SequenceData, read_MultiRBH_gene_families,mrbh,ortho_infer,genes2fams,endt,segmentsaps,bsog
ModuleNotFoundError: No module named 'wgd.core'
wgd viz 09:05:19 INFO This is wgd v1.2 cli.py:32 Traceback (most recent call last): File "/home/.conda/envs/WGD/bin/wgd", line 10, in
sys.exit(cli())
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(args, kwargs)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(args, kwargs)
File "/home/.conda/envs/WGD/lib/python3.6/site-packages/cli.py", line 533, in viz
_viz(kwargs)
File "/home/.conda/envs/WGD/lib/python3.6/site-packages/cli.py", line 536, in _viz
from wgd.viz import elmm_plot, apply_filters, multi_sp_plot, default_plot,all_dotplots,filter_by_minlength,dotplotunitgene,dotplotingene,filter_mingenumber
ImportError: cannot import name 'elmm_plot'
Any help would be great. Thanks