Closed vidsvur closed 2 months ago
Hi, two quick comments! You set the option --plotapgmm
but it seems that you didn't provide the anchor pair data via the option -ap
, which is required to perform the anchor Ks GMM analysis. You manually set the species pairs you want, which is good, but you may also use the option --focus2all
instead to set the focal species and automately let species pairs to be between the focal and all the remaining species. The column names in the global_MRBH.tsv.ks.tsv
file should be the same as the species name in the species tree
, could you please share that how your species tree and column names of the global_MRBH.tsv.ks.tsv
file look like? Thanks for your interest in wgd v2
again!
Thank you for your response! I incorporated your changes, but ended up with this
14:11:08 INFO This is wgd v2.0.25 cli.py:32
Traceback (most recent call last):
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/bin/wgd", line 33, in <module>
sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/FCAM/vvuruputoor/wgd/cli.py", line 588, in viz
_viz(**kwargs)
File "/home/FCAM/vvuruputoor/wgd/cli.py", line 629, in _viz
multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=nodeaveraged,user_xlim=xlim,user_ylim=ylim,adjustortho=adjustortho,adfactor=adjustfactor,okalpha=okalpha,focus2all=focus2all,clean=classic)
File "/home/FCAM/vvuruputoor/wgd/wgd/viz.py", line 625, in multi_sp_plot
ratediffplot(df,outdir,focus2all,sptree,onlyrootout,reweight,extraparanomeks,ap,na=na,elmm=plotelmm,mEM=max_EM_iterations,nEM=num_EM_initializations,pt=peak_threshold,rh=rel_height,components=components,apgmm=plotapgmm)
File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 784, in ratediffplot
getspairplot_cov_cor(df,focusp,speciestree,onlyrootout,reweight,extraparanomeks,anchorpoints,outdir,na=na,elmm=elmm,mEM=mEM,nEM=nEM,pt=pt,rh=rh,components=components,apgmm=apgmm)
File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 746, in getspairplot_cov_cor
fig,spairs_means_stds_samples,ax_spair,maxim_spair,ks_spair,ax_spair_fp,fig_fp,fig_sigs,ax_sigs = plotspair_cov(df,all_spairs,spairs,focusp,reweight,na=na)
File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 590, in plotspair_cov
kde = stats.gaussian_kde(y,weights=w,bw_method=0.1)
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/scipy/stats/kde.py", line 193, in __init__
raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.
global_MRBH.tsv.ks.tsv
looks like this:
pair N S alignmentcoverage alignmentidentity alignmentlength dN dN/dS dS family g1 g2 l node node_averaged_dS_outlierexcluded node_averaged_dS_outlierincluded strippedalignmentlength t weightoutlierexcluded weightoutlierincluded gene1 gene2
Pa1_20727.1__jg10089 5191.0 1808.0 0.9991434689507496 0.9161308758394056 7005 0.0043 0.0131 0.3282 GF00000001 Fhyg_Uconn_BRAKER_genBlastG_sorted_CDS.fasta_16716 pafricana.chr_longest_isoform.cds.fasta_17136 -10964.310334 0 0.35235 0.35235 6999 0.2639 0.5 0.5 jg10089 Pa1_20727.1
Pp08Ghq00947.t1__jg10089 5241.0 1758.0 0.9991434689507496 0.9161308758394056 7005 0.004 0.0107 0.3765 GF00000001 Physcomitrium_patens_longest_isoform.cds.fasta_11556 Fhyg_Uconn_BRAKER_genBlastG_sorted_CDS.fasta_16716 -11047.867212 0 0.35235 0.35235 6999 0.2928 0.5 0.5 Pp08Ghq00947.t1 jg10089
species tree:
(Fhyg_Uconn_BRAKER_genBlastG_sorted_CDS.fasta,(pafricana.chr_longest_isoform.cds.fasta,Physcomitrium_patens_longest_isoform.cds.fasta));
Hi, there is indeed a small bug in the species name prasing of the Ks datafile in the function ratediffplot
. I have fixed that issue and pushed the latest code here and onto PYPI with version 2.0.26. Could you please install the latest commit and try again?
Thanks! I've installed the latest commit, and I'm running
wgd viz -d global_MRBH.tsv.ks.tsv --extraparanomeks pafricana.chr_longest_isoform.cds.fasta.tsv.ks.tsv --focus2all pafricana.chr_longest_isoform.cds.fasta -sp speciestree.txt -o wgd_viz_mixed_Ks_1 --plotkde --plotelmm --plotapgmm -ap wgd_syn/iadhore-out/anchorpoints.txt
I've run the program so far (only been about 4 minutes), and the stdout looks like this:
▽
23:35:58 INFO This is wgd v2.0.26 cli.py:32
The only reason I posted this is because I'm used to seeing lines about analysis kick in about a minute after I start the program-unsure if this was an intended change. Nevertheless, I'll ping here later when the job ends! Thanks again for all your help.
It is indeed a good suggestion. I have added more log information in a newer commit here but not on PYPI
. You can pull as you wish :-)
Hi @heche-psb! I'm trying to use wgd v2.0.26, but I'm running into the following error again:
/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/Bio/Seq.py:2855: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
warnings.warn(
Traceback (most recent call last):
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/bin/wgd", line 33, in <module>
sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/FCAM/vvuruputoor/wgd/cli.py", line 588, in viz
_viz(**kwargs)
File "/home/FCAM/vvuruputoor/wgd/cli.py", line 629, in _viz
multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=nodeaveraged,user_xlim=xlim,user_ylim=ylim,adjustortho=adjustortho,adfactor=adjustfactor,okalpha=okalpha,focus2all=focus2all,clean=classic,toparrow=toparrow)
File "/home/FCAM/vvuruputoor/wgd/wgd/viz.py", line 625, in multi_sp_plot
ratediffplot(df,outdir,focus2all,sptree,onlyrootout,reweight,extraparanomeks,ap,na=na,elmm=plotelmm,mEM=max_EM_iterations,nEM=num_EM_initializations,pt=peak_threshold,rh=rel_height,components=components,apgmm=plotapgmm)
File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 784, in ratediffplot
getspairplot_cov_cor(df,focusp,speciestree,onlyrootout,reweight,extraparanomeks,anchorpoints,outdir,na=na,elmm=elmm,mEM=mEM,nEM=nEM,pt=pt,rh=rh,components=components,apgmm=apgmm)
File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 746, in getspairplot_cov_cor
fig,spairs_means_stds_samples,ax_spair,maxim_spair,ks_spair,ax_spair_fp,fig_fp,fig_sigs,ax_sigs = plotspair_cov(df,all_spairs,spairs,focusp,reweight,na=na)
File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 590, in plotspair_cov
kde = stats.gaussian_kde(y,weights=w,bw_method=0.1)
File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/scipy/stats/kde.py", line 193, in __init__
raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.
My command is: wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv --focus2all lim.cds --extraparanomeks wgd_ksd/lim.cds.tsv.ks.tsv --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.txt -o wgd_viz_mixed_Ks --plotapgmm --plotelmm
speciestree: (ixodes_chr.cds,trich_chr.cds((((tgigas.cds,tach.cds),crot.cds),lim.cds)));
the first few lines of wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv
is:
pair N S alignmentcoverage alignmentidentity alignmentlength dN dN/dS dS family g1 g2 l node node_averaged_dS_outlierexcluded node_averaged_dS_outlierincluded strippedalignmentlength t weightoutlierexcluded weightoutlierincluded gene1 gene2
Cr_038751-T1__ISCP_004587.R5732 571.9 181.1 0.045942999532783 0.0587570621468926 19263 0.0265 0.0011 24.3616 GF00000001 crot.cds_13242 ixodes_chr.cds_03514 -1341.830896 2 NaN 32.34685 885 17.6399 NaN 0.25 Cr_038751-T1 ISCP_004587.R5732
ISCP_004587.R5732__evm.model.Hic_chr_6.1380 570.6 182.4 0.045942999532783 0.0587570621468926 19263 0.0268 0.001 26.7505 GF00000001 tach.cds_11523 ixodes_chr.cds_03514 -1343.942476 2 NaN 32.34685 885 19.4976 NaN 0.25 evm.model.Hic_chr_6.1380 ISCP_004587.R5732
Cr_038751-T1__Ta_00006746-RA 587.4 165.6 0.045942999532783 0.0587570621468926 19263 0.0391 0.001 39.0895 GF00000001 trich_chr.cds_09803 crot.cds_13242 -1353.857349 2 NaN 32.34685 885 25.8776 NaN 0.25 Ta_00006746-RA Cr_038751-T1
Ta_00006746-RA__evm.model.Hic_chr_6.1380 587.5 165.5 0.045942999532783 0.0587570621468926 19263 0.0392 0.001 39.1858 GF00000001 trich_chr.cds_09803 tach.cds_11523 -1353.561212 2 NaN 32.34685 885 25.9353 NaN 0.25 Ta_00006746-RA evm.model.Hic_chr_6.1380
ISCP_004587.R5732__g40216 601.8 151.2 0.045942999532783 0.0587570621468926 19263 20.7725 64.9805 0.3197 GF00000001 lim.cds_18572 ixodes_chr.cds_03514 -2028.275088 0 1.39634 1.39634 885 50.0 0.2 0.2 g40216 ISCP_004587.R5732
Hi, how did you get your global_MRBH.tsv.ks.tsv
file? The errors indicate that not every considered species pair has Ks data available.
Ah that's interesting! I did the following command: wgd ksd wgd_globalmrbh/global_MRBH.tsv ../*.cds -o wgd_globalmrbh_ks
My directory has all the seq data saved in the format of a *.cds file!
Could you share me with your wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv
and wgd_ksd/lim.cds.tsv.ks.tsv
files? I will try to reproduce your issue.
wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv
:
pair N S alignmentcoverage alignmentidentity alignmentlength dN dN/dS dS family g1 g2 l node node_averaged_dS_outlierexcluded node_averaged_dS_outlierincluded strippedalignmentlength t weightoutlierexcluded weightoutlierincluded gene1 gene2
Cr_038751-T1__ISCP_004587.R5732 571.9 181.1 0.045942999532783 0.0587570621468926 19263 0.0265 0.0011 24.3616 GF00000001 crot.cds_13242 ixodes_chr.cds_03514 -1341.830896 2 NaN 32.34685 885 17.6399 NaN 0.25 Cr_038751-T1 ISCP_004587.R5732
ISCP_004587.R5732__evm.model.Hic_chr_6.1380 570.6 182.4 0.045942999532783 0.0587570621468926 19263 0.0268 0.001 26.7505 GF00000001 tach.cds_11523 ixodes_chr.cds_03514 -1343.942476 2 NaN 32.34685 885 19.4976 NaN 0.25 evm.model.Hic_chr_6.1380 ISCP_004587.R5732
Cr_038751-T1__Ta_00006746-RA 587.4 165.6 0.045942999532783 0.0587570621468926 19263 0.0391 0.001 39.0895 GF00000001 trich_chr.cds_09803 crot.cds_13242 -1353.857349 2 NaN 32.34685 885 25.8776 NaN 0.25 Ta_00006746-RA Cr_038751-T1
Ta_00006746-RA__evm.model.Hic_chr_6.1380 587.5 165.5 0.045942999532783 0.0587570621468926 19263 0.0392 0.001 39.1858 GF00000001 trich_chr.cds_09803 tach.cds_11523 -1353.561212 2 NaN 32.34685 885 25.9353 NaN 0.25 Ta_00006746-RA evm.model.Hic_chr_6.1380
ISCP_004587.R5732__g40216 601.8 151.2 0.045942999532783 0.0587570621468926 19263 20.7725 64.9805 0.3197 GF00000001 lim.cds_18572 ixodes_chr.cds_03514 -2028.275088 0 1.39634 1.39634 885 50.0 0.2 0.2 g40216 ISCP_004587.R5732
Cr_038751-T1__g40216 591.1 161.9 0.045942999532783 0.0587570621468926 19263 3.3093 2.8952 1.1431 GF00000001 lim.cds_18572 crot.cds_13242 -1994.804078 0 1.39634 1.39634 885 8.5307 0.2 0.2 g40216 Cr_038751-T1
evm.model.Hic_chr_6.1380__g40216 591.1 161.9 0.045942999532783 0.0587570621468926 19263 3.371 2.9733 1.1338 GF00000001 tach.cds_11523 lim.cds_18572 -1993.633127 0 1.39634 1.39634 885 8.6698 0.2 0.2 evm.model.Hic_chr_6.1380 g40216
wgd_ksd/lim.cds.tsv.ks.tsv
:
-bash-4.2$ tail lim.cds.tsv.ks.tsv
g3546__g3547 377.4 105.6 0.688034188034188 0.5714285714285714 702.0 0.5434 0.1186 4.5828 GF00002902 lim.cds_09737 lim.cds_09736 -1136.677534 0.0 4.5828 4.5828 483.0 4.2791 1.0 1.0 g3547 g3546
g4268__g8023 407.2 108.8 0.7889908256880734 0.8255813953488372 654.0 0.0756 0.071 1.0654 GF00002903 lim.cds_11584 lim.cds_02668 -940.777828 0.0 1.0654 1.0654 516.0 0.8529 1.0 1.0 g4268 g8023
g2241__g23648 434.9 132.1 0.9593908629441624 0.7760141093474426 591.0 0.127 0.1125 1.1286 GF00002904 lim.cds_08877 lim.cds_13557 -1147.017268 0.0 1.1286 1.1286 567.0 1.081 1.0 1.0 g2241 g23648
g32887__g43804 274.6 103.4 0.4701492537313433 0.9417989417989416 804.0 0.0476 0.4458 0.1068 GF00002905 lim.cds_25090 lim.cds_26639 -604.196061 0.0 0.1068 0.1068 378.0 0.1914 1.0 1.0 g32887 g43804
g35017__g35019 264.8 92.2 0.5327510917030568 0.5163934426229508 687.0 0.8694 0.7384 1.1775 GF00002906 lim.cds_27755 lim.cds_27754 -904.411027 0.0 1.1775 1.1775 366.0 2.8469 1.0 1.0 g35019 g35017
g31108__g5812 351.6 77.4 0.7566137566137566 0.7948717948717948 567.0 0.1814 0.2357 0.7694 GF00002907 lim.cds_15438 lim.cds_12534 -834.311488 0.0 0.7694 0.7694 429.0 0.8623 1.0 1.0 g31108 g5812
g31875__g4624 241.3 97.7 0.6766467065868264 0.6932153392330384 501.0 0.2596 0.1911 1.3586 GF00002908 lim.cds_15910 lim.cds_11748 -735.369266 0.0 1.3586 1.3586 339.0 1.7292 1.0 1.0 g31875 g4624
g21344__g47291 277.3 94.7 1.0 0.7634408602150538 372.0 0.0258 0.0026 9.9966 GF00002909 lim.cds_01975 lim.cds_08216 -669.771474 0.0 NaN 9.9966 372.0 7.6882 NaN1.0 g21344 g47291
Hi, based on the provided files, the Ks data from species pairs lim.cds__tgigas.cds
, ixodes_chr.cds__tach.cds
, crot.cds__ixodes_chr.cds
, ixodes_chr.cds__tgigas.cds
is missing. Could you check whether these species pairs have orthologous gene pairs and associated Ks data?
Hi, I only see
-bash-4.2$ cd wgd_globalmrbh -bash-4.2$ ls crot.cds_ixodes_chr.cds.rbh.tsv ixodes_chr.cds_lim.cds.rbh.tsv lim.cds_trich_chr.cds.rbh.tsv crot.cds_lim.cds.rbh.tsv ixodes_chr.cds_tach.cds.rbh.tsv tach.cds_tgigas.cds.rbh.tsv crot.cds_tach.cds.rbh.tsv ixodes_chr.cds_tgigas.cds.rbh.tsv tach.cds_trich_chr.cds.rbh.tsv crot.cds_tgigas.cds.rbh.tsv ixodes_chr.cds_trich_chr.cds.rbh.tsv tgigas.cds_trich_chr.cds.rbh.tsv crot.cds_trich_chr.cds.rbh.tsv lim.cds_tach.cds.rbh.tsv global_MRBH.tsv lim.cds_tgigas.cds.rbh.tsv
But I don't see lim.cds__tgigas.cds, ixodes_chr.cdstach.cds, crot.cdsixodes_chr.cds, ixodes_chr.cds__tgigas.cds.. would that be in my wgd_globalmrbh_ks folder?
that folder contains:
-bash-4.2$ cd wgd_globalmrbh_ks -bash-4.2$ ls global_MRBH.tsv.ksd.pdf global_MRBH.tsv.ksd.svg global_MRBH.tsv.ks.tsv
and my wgd_ks folder contains:
-bash-4.2$ cd wgd_ksd -bash-4.2$ ls lim.cds.tsv.ksd.pdf lim.cds.tsv.ksd.svg lim.cds.tsv.ks.tsv
The preview I gave of wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv has all the species pairs, but instead of lim.cds__tgigas.cds, ixodes_chr.cdstach.cds, crot.cdsixodes_chr.cds, ixodes_chr.cds__tgigas.cds the pairs are referred by their gene IDs. Is it easier to send my files over to you to debug? wgd_debug.zip https://drive.google.com/file/d/1tFoJhKYYTzxfk25laaLVpixVND-ghrJ-/view?usp=drive_web Vidya S Vuruputoor
On Tue, 26 Mar 2024 at 04:53, heche-psb @.***> wrote:
Hi, based on the provided files, the Ks data from species pairs lim.cds__tgigas.cds, ixodes_chr.cdstach.cds, crot.cdsixodes_chr.cds, ixodes_chr.cds__tgigas.cds is missing. Could you check whether these species pairs have orthologous gene pairs and associated Ks data?
— Reply to this email directly, view it on GitHub https://github.com/heche-psb/wgd/issues/23#issuecomment-2019853216, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFEH5MR2FSMS47KYAWSR7DY2ESN3AVCNFSM6AAAAABDYC4MVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJZHA2TGMRRGY . You are receiving this because you authored the thread.Message ID: @.***>
Hi, I think that's the point. You may try the wgd dmd -gm
with all your considered cds files and then calculate the Ks again.
@heche-psb I did run wgd dmd --globalmrbh ../*.cds -o wgd_globalmrbh
initially. It's re-running now, I'll keep you posted!
Thanks, in the meantime I will also rerun the whole procedure based on your shared data to reproduce your issue.
It ran into the same error, albeit it a more verbose error this time!
/home/FCAM/vvuruputoor/wgd2.28/wgd/wgd_venv/lib/python3.8/site-packages/scipy/stats/kde.py in __init__(self=<scipy.stats.kde.gaussian_kde object>, dataset=[], bw_method=0.1, weights=[])
188
189 """
190 def __init__(self, dataset, bw_method=None, weights=None):
191 self.dataset = atleast_2d(asarray(dataset))
192 if not self.dataset.size > 1:
--> 193 raise ValueError("`dataset` input should have multiple elements.")
194
195 self.d, self.n = self.dataset.shape
196
197 if weights is not None:
ValueError: `dataset` input should have multiple elements.
___________________________________________________________________________
My log file shows this:
18:21:33 INFO This is wgd v2.0.28 cli.py:32
18:21:36 INFO Reading species tree and categorizing ratecorrect.py:876
sister&outgroup species
INFO Composing trios (outgroup,(focal,sister)) ratecorrect.py:889
INFO Sampling, calculating and plotting 200 ratecorrect.py:892
bootstrap replicates for 7 orthologous Ks
distributions using 4 threads (which might
take a while..)
INFO Note that the number of bootstrap ratecorrect.py:893
replicates can be adjusted via the option
--bootstrap
Hi, I just reproduced your issue, which was caused by the kde estimation upon data with only one Ks datapoint (which is strictly supposed to have multiple datapoints). I have fixed it and pushed it as v2.0.29
, please install the latest and have a try. The ELMM analysis of my run is still ongoing and I will post the final result here once I got it.
You should have the log information like below:
$wgd viz -d test_ksd/global_MRBH.tsv.ks.tsv -epk wgd_ksd/lim.cds.tsv.ks.tsv -sp speciestree.txt -fa lim.cds --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt -o test_viz --plotapgmm --plotelmm
10:10:54 INFO This is wgd v2.0.29 cli.py:32
10:10:55 INFO Reading species tree and categorizing sister&outgroup species ratecorrect.py:890
INFO Composing trios (outgroup,(focal,sister)) ratecorrect.py:903
INFO Sampling, calculating and plotting 200 bootstrap replicates for 7 orthologous Ks distributions using 4 ratecorrect.py:906
threads (which might take a while..)
INFO Note that the number of bootstrap replicates can be adjusted via the option --bootstrap ratecorrect.py:907
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-heche'
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 38.09it/s]
10:10:59 INFO Sampling done, now making plots ratecorrect.py:714
10:11:03 INFO Species pair crot.cds__ixodes_chr.cds has only one unique orthologous Ks datapoint, ratecorrect.py:740
a random datapoint with negligible difference is thus added
INFO Species pair ixodes_chr.cds__tach.cds has only one unique orthologous Ks datapoint, ratecorrect.py:740
a random datapoint with negligible difference is thus added
10:11:06 INFO Synonymous substitution rate correction done (info written in output) ratecorrect.py:910
Now adding correction shade onto plots
10:12:57 INFO Plotting the final mixed Ks distribution ratecorrect.py:934
10:13:01 INFO ELMM analysis on paralogous Ks distribution ratecorrect.py:398
10:13:02 INFO An extra buffer lognormal component with mean 5.00 and std 0.30 is appended ratecorrect.py:287
INFO Found 4 likely peak signals ratecorrect.py:290
INFO The initiative means and stds is 0.12 0.62 ratecorrect.py:291
INFO The initiative means and stds is 0.22 0.75 ratecorrect.py:291
INFO The initiative means and stds is 4.26 0.75 ratecorrect.py:291
INFO The initiative means and stds is 5.00 0.30 ratecorrect.py:291
INFO Performing EM algorithm from initializated data (Model1) ratecorrect.py:297
10:13:06 INFO BIC of Model1: 22755.45 ratecorrect.py:303
INFO Performing EM algorithm from initializated data plus a random lognormal component (Model2) ratecorrect.py:305
Okay! Installed and re-running, will message if there is any error. Thank you so much, @heche-psb :)
I've been running wgd ever since, and I'm still running into errors. It feels like two steps back, but I'm unable to generate wgd_globalmrbh_final/global_MRBH.tsv
I think your previous command is correct.
wgd dmd --globalmrbh *.fasta -o wgd_globalmrbh
source ~/wgd2_2.0.29/wgd2.0.29/bin/activate
module load python/3.8.1
export PATH=$PATH:/home/FCAM/vvuruputoor/i-adhore-3.0.01/bin
wgd dmd --globalmrbh *.cds -o wgd_globalmrbh_final
Hi @heche-psb, I'm re-running wgd with these files:
crot.cds ixodes_chr.cds lim.cds tgigas.cds trich_chr.cds
What I find a bit fishy is that this step has taken over 24 hours, and still has not generated the global_MRBH.tsv
file
I think it's because the size of cds files is big. A quick test is that you can only retain 100 genes per cds file and then conduct the same command to see the running time. If with only 100 genes per cds file it's still very slow, then I will take a closer look at that relevant code block. Thanks for you patience! :-)
@heche-psb - haha thanks for that one! :) I upped the memory and it ran through.. I'm finally at the wgd viz
stage now.
Command:
wgd viz -d wgd_globalmrbh_ks_final/global_MRBH.tsv.ks.tsv --focus2all lim.cds --extraparanomeks wgd_ksd/lim.cds.tsv.ks.tsv --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.txt -o wgd_viz_mixed_Ks_final --plotapgmm --plotelmm
Species tree:
(ixodes_chr.cds,trich_chr.cds,((tgigas.cds,crot.cds),lim.cds));
note the removal of tach.cds~
CDS in my dir:
crot.cds ixodes_chr.cds lim.cds tgigas.cds trich_chr.cds
Error:
Traceback (most recent call last):
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/bin/wgd", line 33, in <module>
sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/cli.py", line 597, in viz
_viz(**kwargs)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/cli.py", line 638, in _viz
multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=nodeaveraged,user_xlim=xlim,user_ylim=ylim,adjustortho=adjustortho,adfactor=adjustfactor,okalpha=okalpha,focus2all=focus2all,clean=classic,toparrow=toparrow,BT=bootstrap,nthreads=nthreads)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/viz.py", line 625, in multi_sp_plot
ratediffplot(df,outdir,focus2all,sptree,onlyrootout,reweight,extraparanomeks,ap,na=na,elmm=plotelmm,mEM=max_EM_iterations,nEM=num_EM_initializations,pt=peak_threshold,rh=rel_height,components=components,apgmm=plotapgmm,BT=BT)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/ratecorrect.py", line 948, in ratediffplot
getspairplot_cov_cor(df,focusp,speciestree,onlyrootout,reweight,extraparanomeks,anchorpoints,outdir,na=na,elmm=elmm,mEM=mEM,nEM=nEM,pt=pt,rh=rh,components=components,apgmm=apgmm,BT=BT,nthreads=nthreads)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/ratecorrect.py", line 905, in getspairplot_cov_cor
else: all_spairs,spairs,Trios,Trios_dict = gettrios_overall(focusp,Ingroup_spnames,Outgroup_spnames,Ingroup_clade)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/ratecorrect.py", line 810, in gettrios_overall
mrca = Ingroup_clade.common_ancestor({"name": sister}, {"name": focusp})
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/Bio/Phylo/BaseTree.py", line 469, in common_ancestor
raise ValueError("target %s is not in this tree" % repr(t))
ValueError: target 'name' is not in this tree
okay- so maybe this is a result of my speciestree not being represented correctly.
When i change up my tree a bit: ((trich_chr.cds, ((tgigas.cds, crot.cds), lim.cds)), ixodes_chr.cds);
i run into a longer error message, but the end of the message is:
/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/scipy/linalg/basic.py in inv(a=array([[0.]]), overwrite_a=False, check_finite=True)
972 # all tests pass. Further investigation is required if
973 # more such SEGFAULTs occur.
974 lwork = int(1.01 * lwork)
975 inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
976 if info > 0:
--> 977 raise LinAlgError("singular matrix")
978 if info < 0:
979 raise ValueError('illegal value in %d-th argument of internal '
980 'getrf|getri' % -info)
981 return inv_a
LinAlgError: singular matrix
global_MRBH.tsv.ks.csv speciestree.txt
The above is my input files for my last successful run. Could you try it on your system and see if it works? (Please change the global_MRBH.tsv.ks.csv
into global_MRBH.tsv.ks.tsv
) The command can be like below:
$wgd viz -d global_MRBH.tsv.ks.tsv -epk wgd_ksd/lim.cds.tsv.ks.tsv -sp speciestree.txt -fa lim.cds --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt -o test_viz --plotapgmm --plotelmm
Besides, the direct reason of the error LinAlgError: singular matrix
is that some species pairs have only 1 Ks datapoint.
Do you recommend that I remove those species pairs?
Thanks for the above files! Will run a wgd viz on those files.
Do you recommend that I remove those species pairs?
It is indeed an alternative solution that you could use local MRBHs joint at the focal species lim.cds
, instead of using the global MRBHs
which are too stringent. For that you need to run wgd dmd crot.cds ixodes_chr.cds lim.cds tgigas.cds trich_chr.cds -f lim.cds
and then wgd ksd
.
Okay! I was able to get through the first two steps, but I'm hitting an error with wgd viz
My script so far:
#wgd dmd crot.cds ixodes_chr.cds lim.cds tgigas.cds trich_chr.cds -f lim.cds -o wgd_dmd_lim
#wgd ksd wgd_dmd_lim/merge_focus.tsv *.cds -o wgd_lim_ks
wgd viz -d wgd_dmd_lim/merge_focus.tsv --focus2all lim.cds --extraparanomeks wgd_ksd/lim.cds.tsv.ks.tsv -sp ../speciestree.txt -o wgd_viz_mixed_Ks_try1 --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt --plotapgmm --plotelmm
I'm hitting an error with wgd viz
:
Traceback (most recent call last):
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/bin/wgd", line 33, in <module>
sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/cli.py", line 597, in viz
_viz(**kwargs)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/cli.py", line 633, in _viz
ksdb_df = formatv2(ksdb_df)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/utils.py", line 27, in formatv2
if "weightoutlierexcluded" not in ksdf.columns: weight_inc = get_outlierincluded(ksdf)
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/utils.py", line 48, in get_outlierincluded
weight_inc = 1/df.groupby(['family', 'node'])['dS'].transform('count')
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/pandas/core/frame.py", line 7721, in groupby
return DataFrameGroupBy(
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 882, in __init__
grouper, exclusions, obj = get_grouper(
File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 882, in get_grouper
raise KeyError(gpr)
KeyError: 'family'
Hi, it should be the Ks file to be provided for wgd viz
. You may change wgd viz -d wgd_dmd_lim/merge_focus.tsv
into wgd viz -d wgd_lim_ks/merge_focus.tsv.ks.tsv
.
Ah, should've caught this sooner. Apologies!
Thank you so much! I could create all plots!!🙏
Hello, thank you again for this really useful tool. I was successfully able to run several commands:
wgd dmd --globalmrbh *.fasta -o wgd_globalmrbh
All my sequence files are the longest isoform of the genes. This successfully generated the family file.
I ran
wgd ksd wgd_globalmrbh/global_MRBH.tsv *.fasta -n 16 -o wgd_globalmrbh_ks
and this ran through- and I got auto-generated Ks plots for the paranome as well.When I run:
I run into an error:
species trees have the same name as the fasta files, and i double checked the global_MRBH.tsv.ks.tsv and pafricana.chr_longest_isoform.cds.fasta.tsv.ks.tsv files- and there was no error in these steps (after seeing the recommendations across all other issues)
I also created my own
gene_species.map
- because that was not generated at the wgd ksd step, but even then i run into the same error. Please let me know what I should change. Thank you!