Closed hyyuu closed 4 months ago
Hi, thanks for using wgd v2
! It's required to pre-install the program mcl to run the wgd dmd
program. Could you please install the mcl
and try again?
Thank you so much for your prompt reply! It works perfectly now. But I have one more question about the choice of the mixture model. May I ask why is elmm modeling used on paralogous KS values while gmm modeling is used on anchor KS values? I am sorry that I cannot find relevant answers online. Your answer would be greatly appreciated, or any recommendation of relevant papers would be greatly helpful as well.
Thank you very much for your help again!
Best, Alex
Hi, I would suggest reading the paper of ksrates which first implemented the ELMM algorithm on whole-paranome Ks data. In wgd v2
, ELMM analysis in both node-averaged and node-weighted manner is available by the wgd viz
program. The ELMM method on whole-paranome Ks data is based on the assumption that small-scale gene duplication (and loss, SSDL) is the main source of gene duplicates in a genome without WGDs while an additional WGD will lead to the burst of gene duplicates and thus leave a Ks peak on the Ks distribution. Such that the exponential component is to model the background Ks distribution of SSDL while the normal components are to model the further WGDs. The Ks distribution of anchor pairs, on the other hand, is assumed to be only impacted by WGDs, upon which only normal components are thus needed in the mixture model to delineate WGDs. Thanks for your suggestions and I will update doc later to make it clearer!
Hello, thank you for your detailed reply! I will check out the paper later. I am sorry that I have one more question about interpreting the Ks distributions of paralogs. I am not sure whether the inferred components are WGD or just over-fitting.
Below the component b ( red line) has a peak near K=0. Should it be considered a potential WGD event or just a burst of recent gene duplication ( or the recent gene duplication should be represented by the blue line, component a)? As the Ks distribution is L-shape.
Below is another slightly different scenario, the number of duplicates is slightly higher at K=0.1 than K=0.0. Should the red line (component b, Ks= 0.11/0.13) be interpreted as WGD here? lignic parsp
And I observed an interesting result below. Two components, b and c, record peaks at 0.03 and 0.16. Does it mean there are two potential WGDs here or should be considered as over-fitting?
Sorry for the lengthy question. I greatly appreciate your kind help! Thank you.
Regards, Alex
Hi, interesting results. Could you please share or tell me that what is the GMM result of the anchor Ks distribution? If the anchor Ks distribution also has the very recent Ks peak, then you'd better further check the collinear dot plot (with Ks values denoted) to see if there are really massive collinear genes of very recent age. On the other side, young Ks peaks (near zero) are frequently present in transcriptome-derived Ks distributions due to redundancy and alternative splicing. Cautions need to be thus taken for the interpretation of the mixture modeling result of transcriptome-derived Ks distributions.
Hello ! Thank you for your explanation! That's very helpful ! For all these four species, they are tanscriptome-derived Ks distributions. Yet, I ran CD-HIT on the cds file to remove redundancy before running wgd. Are these near-zero peaks still caused by the redundancy? For these graphs, I should consider there is no WGD for now?
I did several more species with transcriptomes. Around 80% of these Ks-distributions exhibit a normal L-shape. Is it because the level of redundancy differs between transcriptomes?
I only have four interested species that have annotations. I tried to the anchor Ks distribution. But no Ks plot is generated after wgd ksd. In the log file, it said :
WARNING No codeml result for GF00008501 due to no codeml.py:291
resolved nucleotides
WARNING No codeml result for GF00008499 due to no codeml.py:291
resolved nucleotides
WARNING No codeml result for GF00008502 due to no codeml.py:291
resolved nucleotides
16:26:58 INFO Saving to cli.py:519
wgd_ksd/Amphibalanus_amphitrite_GCF_019059575.1_NRL
GWU_Aamphi_draft_cds_from_genomic.fna.tsv.ks.tsv
16:27:02 INFO Making plots cli.py:521
INFO No valid Ks values for plotting cli.py:523
I got the same message for all four species with the data downloaded from NCBI RefSeq. Could you please tell me how to fix this? The ftp link of one of my species is below. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/764/305/GCF_000764305.2_Hazt_2.0.2/
Thank you very much again for answering my questions! It really helps me to interpret the result accurately!
Warm regards, Alex
I tried with GCF_000764305.2_Hazt_2.0.2_cds_from_genomic.fna
in this directory which gave me some Ks results. Just to make sure we're using the same commit, I updated the version 2.0.25
just now. Could you please install the latest version and try again? On the other hand, the peaks at Ks= 0.11/0.13/0.16 seem to be indicative of WGDs, despite that it's impossible to claim that the redundancy or alternative splicing is completely eliminated due to the nature of transcriptome. Maybe you can share me with the dotplot? and let's see how much percentages of genomic regions are duplicated (as we would expect many-to-most genomic regions being duplicated due to the young WGD age).
Hello! Thank you for checking the data!
I upgraded wgd, now it can output Ks value. For three species. However, it still said
INFO No valid Ks values for plotting cli.py:523
So, there is no output of Ks plot after wgd ksd.
On the other hand , for this species, I have no Ks value output again. Is it because of the quality of assembly? https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/059/575/GCF_019059575.1_NRLGWU_Aamphi_draft/
Then I continue to run wgd syn for anchor Ks distribution. It failed again. I tried -f with CDS and mRNA. Both have the same error. Should I parse the gff file for this step? Or did I do something wrong ?
ERROR No genes from families file cli.py:701
`wgd_dmd/GCF_011947565.3_Ppol_2.1_cds_from_genomic.
fna.tsv` found in the GFF file for `feature=CDS`
and `attribute=Name`, please double check command
settings.
For the transcriptomic KS distribution, I am sorry that they only have transcriptomes available. There is no genomic data for making dot plots. I only have their BUSCO scores of the transcriptomes ( Duplicate score ranges from 15 to 30, for those Ks= 0.11/0.13/0.16). Is there other way to check these duplicated genomic regions ?
Thank you for your kind help again!
Warm regards, Alex
Hello, I just tried again with wgd syn for the anchor distribution. I parse the sequence to be same as the "Name" as stated in gff. i.e., >XP_xxxx.X, and I ran a CD-Hit on the NCBI Refseq cds file. This time, wgd ksd ran smoothly and Ks plot is generated. However, I got another error at wgd syn, unlike what I replied in the last comment.
nohup: ignoring input
23:44:37 INFO This is wgd v2.0.25 cli.py:32
23:44:40 INFO Note: NumExpr detected 64 cores but utils.py:147
"NUMEXPR_MAX_THREADS" not set, so enforcing safe
limit of 8.
INFO NumExpr defaulting to 8 threads. utils.py:159
INFO Configuring I-ADHoRe co-linearity search cli.py:708
INFO Writing families file syn.py:96
INFO Writing gene lists syn.py:98
ERROR There are duplicated gene IDs for given feature and syn.py:121
attribute
Sorry for throwing two comments at the same time. Thank you very much for your kind help again !
Warms regards, Alex
Hello, I am sorry for bothering you. I still could not solve the bugs above after additional runs. Do you have any idea about this? Your help will be greatly appreciated!
Best, Alex
Hi, there seems to be duplicated gene IDs in your cds file, which is not allowed. It's mandatory to keep each gene ID unique and findable in the gff3 file.
Hello ! Thank you very much for your help ! I successfully ran the anchor pair analyses on a few species now !
However, when I tried on a non-chromosome level assembly, it returned an error at wgd syn
15:59:52 INFO This is wgd v2.0.25 cli.py:32
15:59:54 INFO Note: NumExpr detected 64 cores but utils.py:147
"NUMEXPR_MAX_THREADS" not set, so enforcing safe
limit of 8.
INFO NumExpr defaulting to 8 threads. utils.py:159
INFO Configuring I-ADHoRe co-linearity search cli.py:708
INFO Writing families file syn.py:96
INFO Writing gene lists syn.py:98
15:59:58 INFO Writing config file syn.py:100
16:00:03 INFO Running I-ADHoRe cli.py:712
16:00:10 WARNING WARNING: Maximum allowed number of gaps in the syn.py:186
alignment not specified. Setting to cluster_gap.
WARNING: Tandem gap size not correct in settings
file. Using default (gap_size / 2)
INFO syn.py:187
This is i-ADHoRe v3.0.
Copyright (c) 2002-2010, Flanders Interuniversity
Institute for Biotechnology, VIB.
Algorithm designed by Klaas Vandepoele, Cedric
Simillion, Jan Fostier, Dieter De Witte,
Koen Janssens, Sebastian Proost, Yvan Saeys and
Yves Van de Peer.
Process 1/1 is alive on localhost.
************* i-ADHoRe parameters *************
Number of genelists = 1954
Blast table =
/home/hiuyan/synteny/wgd/wgd_syn/families.tsv
Output path =
/home/hiuyan/synteny/wgd/wgd_syn/iadhore-out/
Gap size = 30
Cluster gap size = 35
Cloud gap size = 0
Cloud cluster gap size = 0
Max gaps in alignment = 35
Tandem gap = 15
Flush output = 1000
Q-value = 0.75
Anchorpoints = 3
Probability cutoff = 0.01
Cloud filtering method = Binomial
Level 2 only = false
Use family = true
Write statistics = false
Alignment method = GreedyGraphbased4
Multiple hypothesis correction = FDR
Number of threads = 1
Compare aligners = false
Collinear searches only
Visualize GHM.png = false
Visualize Alignment = false
Verbose output = true
************ END i-AdDHoRe parameters *********
Creating dataset... done.
(time: 0.033169s)
Mapping gene families... done.
(time: 0.0233521s)
Remapping tandem duplicates... done. (time:
0.005831s)
Writing genelists file... done.
(time: 0.0541091s)
Collinear Search
Level 2 multiplicon detection... done.
(time: 6.67852s)
Profile detection...
1 multiplicons to evaluate - evaluating level 2
multiplicon... 0 new multiplicons found.
Flushing output files...done.
Time for Higher Level Detection: 0.00259781s.
All Done! Bye...
INFO Processing I-ADHoRe output cli.py:716
INFO Dropped 524 scaffolds in hya_1.cds.fa because they viz.py:2701
are on scaffolds shorter than 5000
Traceback (most recent call last):
File "/home/hiuyan/.pyenv/versions/3.6.12/bin/wgd", line 11, in <module>
load_entry_point('wgd==2.0.25', 'console_scripts', 'wgd')()
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/cli.py", line 677, in syn
_syn(**kwargs)
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/cli.py", line 730, in _syn
segs = filter_mingenumber(segs_gene_unit,mingenenum,outdir,len(gene_order_dict_allsp))
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgd/viz.py", line 2740, in filter_mingenumber
profile = counted.unstack(level=-1).fillna(0)
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/pandas/core/series.py", line 3903, in unstack
return unstack(self, level, fill_value)
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 425, in unstack
obj.index, level=level, constructor=obj._constructor_expanddim,
File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 92, in __init__
self.index = index.remove_unused_levels()
AttributeError: 'Index' object has no attribute 'remove_unused_levels'
Is it because of the relatively low quality of the used genome here ? This is the genome assembly that I used. https://www.ncbi.nlm.nih.gov/assembly/GCF_000764305.2/?shouldredirect=false
I also have another question about the result from the original anchor pair and the segment-guided Ks. For more ancient WGD ( around Ks =2.0), can I say it is also acceptable to propose the incidence of WGDs using the segment-guided Ks result?
Thank you very much for your help again ! Have a nice weekend.
Best, Alex
Hi, could you please share me the gff3 file and the gene family file that you used exactly? Besides, the segment Ks (the median of all residing syntelog Ks values) and anchor pair Ks are supposed to evolve at different rates, while the former of which reflects a more conserved estimate of the age of duplication event. Both are authentic remnants indicative of (ancient) WGD events. To answer your question, my opinion is that it is credible to speculate an ancient WGD event upon both Ks data, but the interpretation has to be made with caution of Ks saturation (when it's over 2).
Hello , thank you very much for the program ! However , I faced an error at the first step running wgd dmd.
I tried with two different cds files, they all returned the same error messages. Could you please help me to fix this error? I installed wgd v2 via pip.
Thank you very much for your help!
(P.S. Sorry that I posted this question on wgdv1 github page. Sorry for the inconvenience. )
Regards, Alex