heche-psb / wgd

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication
https://wgdv2.readthedocs.io/en/latest/
GNU General Public License v3.0
21 stars 0 forks source link

error in Ks correction analysis #23

Closed vidsvur closed 2 months ago

vidsvur commented 4 months ago

Hello, thank you again for this really useful tool. I was successfully able to run several commands:

wgd dmd --globalmrbh *.fasta -o wgd_globalmrbh

All my sequence files are the longest isoform of the genes. This successfully generated the family file.

I ran wgd ksd wgd_globalmrbh/global_MRBH.tsv *.fasta -n 16 -o wgd_globalmrbh_ks and this ran through- and I got auto-generated Ks plots for the paranome as well.

When I run:

wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv --extraparanomeks wgd_ksd/pafricana.chr_longest_isoform.cds.fasta.tsv.ks.tsv -sp speciestree.txt -o wgd_viz_mixed_Ks_1 --plotkde  --plotelmm --plotapgmm -o wgd_viz_try2 --spair "pafricana.chr_longest_isoform.cds.fasta;Fhyg__genBlastG_sorted_CDS.fasta" --spair "Ppatens_longest_isoform.cds.fasta;pafricana.chr_longest_isoform.cds.fasta" --spair "pafricana.chr_longest_isoform.cds.fasta;pafricana.chr_longest_isoform.cds.fasta"

I run into an error:

Traceback (most recent call last):
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/bin/wgd", line 33, in <module>
    sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd/cli.py", line 588, in viz
    _viz(**kwargs)
  File "/home/FCAM/vvuruputoor/wgd/cli.py", line 629, in _viz
    multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=nodeaveraged,user_xlim=xlim,user_ylim=ylim,adjustortho=adjustortho,adfactor=adjustfactor,okalpha=okalpha,focus2all=focus2all,clean=classic)
  File "/home/FCAM/vvuruputoor/wgd/wgd/viz.py", line 625, in multi_sp_plot
    ratediffplot(df,outdir,focus2all,sptree,onlyrootout,reweight,extraparanomeks,ap,na=na,elmm=plotelmm,mEM=max_EM_iterations,nEM=num_EM_initializations,pt=peak_threshold,rh=rel_height,components=components,apgmm=plotapgmm)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 784, in ratediffplot
    getspairplot_cov_cor(df,focusp,speciestree,onlyrootout,reweight,extraparanomeks,anchorpoints,outdir,na=na,elmm=elmm,mEM=mEM,nEM=nEM,pt=pt,rh=rh,components=components,apgmm=apgmm)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 745, in getspairplot_cov_cor
    else: all_spairs,spairs,Trios,Trios_dict = gettrios_overall(focusp,Ingroup_spnames,Outgroup_spnames,Ingroup_clade)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 648, in gettrios_overall
    sppair = "{}".format("__".join(sorted([sister,focusp])))
TypeError: '<' not supported between instances of 'NoneType' and 'str'

species trees have the same name as the fasta files, and i double checked the global_MRBH.tsv.ks.tsv and pafricana.chr_longest_isoform.cds.fasta.tsv.ks.tsv files- and there was no error in these steps (after seeing the recommendations across all other issues)

I also created my own gene_species.map - because that was not generated at the wgd ksd step, but even then i run into the same error. Please let me know what I should change. Thank you!

heche-psb commented 4 months ago

Hi, two quick comments! You set the option --plotapgmm but it seems that you didn't provide the anchor pair data via the option -ap, which is required to perform the anchor Ks GMM analysis. You manually set the species pairs you want, which is good, but you may also use the option --focus2all instead to set the focal species and automately let species pairs to be between the focal and all the remaining species. The column names in the global_MRBH.tsv.ks.tsv file should be the same as the species name in the species tree, could you please share that how your species tree and column names of the global_MRBH.tsv.ks.tsv file look like? Thanks for your interest in wgd v2 again!

vidsvur commented 4 months ago

Thank you for your response! I incorporated your changes, but ended up with this

14:11:08 INFO     This is wgd v2.0.25                                                                                                                                                              cli.py:32
Traceback (most recent call last):
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/bin/wgd", line 33, in <module>
    sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd/cli.py", line 588, in viz
    _viz(**kwargs)
  File "/home/FCAM/vvuruputoor/wgd/cli.py", line 629, in _viz
    multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=nodeaveraged,user_xlim=xlim,user_ylim=ylim,adjustortho=adjustortho,adfactor=adjustfactor,okalpha=okalpha,focus2all=focus2all,clean=classic)
  File "/home/FCAM/vvuruputoor/wgd/wgd/viz.py", line 625, in multi_sp_plot
    ratediffplot(df,outdir,focus2all,sptree,onlyrootout,reweight,extraparanomeks,ap,na=na,elmm=plotelmm,mEM=max_EM_iterations,nEM=num_EM_initializations,pt=peak_threshold,rh=rel_height,components=components,apgmm=plotapgmm)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 784, in ratediffplot
    getspairplot_cov_cor(df,focusp,speciestree,onlyrootout,reweight,extraparanomeks,anchorpoints,outdir,na=na,elmm=elmm,mEM=mEM,nEM=nEM,pt=pt,rh=rh,components=components,apgmm=apgmm)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 746, in getspairplot_cov_cor
    fig,spairs_means_stds_samples,ax_spair,maxim_spair,ks_spair,ax_spair_fp,fig_fp,fig_sigs,ax_sigs = plotspair_cov(df,all_spairs,spairs,focusp,reweight,na=na)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 590, in plotspair_cov
    kde = stats.gaussian_kde(y,weights=w,bw_method=0.1)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/scipy/stats/kde.py", line 193, in __init__
    raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.

global_MRBH.tsv.ks.tsv looks like this:

pair    N   S   alignmentcoverage   alignmentidentity   alignmentlength dN  dN/dS   dS  family  g1  g2  l   node    node_averaged_dS_outlierexcluded    node_averaged_dS_outlierincluded    strippedalignmentlength t   weightoutlierexcluded   weightoutlierincluded   gene1   gene2
Pa1_20727.1__jg10089    5191.0  1808.0  0.9991434689507496  0.9161308758394056  7005    0.0043  0.0131  0.3282  GF00000001  Fhyg_Uconn_BRAKER_genBlastG_sorted_CDS.fasta_16716  pafricana.chr_longest_isoform.cds.fasta_17136   -10964.310334   0   0.35235 0.35235 6999    0.2639  0.5 0.5 jg10089 Pa1_20727.1
Pp08Ghq00947.t1__jg10089    5241.0  1758.0  0.9991434689507496  0.9161308758394056  7005    0.004   0.0107  0.3765  GF00000001  Physcomitrium_patens_longest_isoform.cds.fasta_11556    Fhyg_Uconn_BRAKER_genBlastG_sorted_CDS.fasta_16716  -11047.867212   0   0.35235 0.35235 6999    0.2928  0.5 0.5 Pp08Ghq00947.t1 jg10089

species tree: (Fhyg_Uconn_BRAKER_genBlastG_sorted_CDS.fasta,(pafricana.chr_longest_isoform.cds.fasta,Physcomitrium_patens_longest_isoform.cds.fasta));

heche-psb commented 4 months ago

Hi, there is indeed a small bug in the species name prasing of the Ks datafile in the function ratediffplot. I have fixed that issue and pushed the latest code here and onto PYPI with version 2.0.26. Could you please install the latest commit and try again?

vidsvur commented 4 months ago

Thanks! I've installed the latest commit, and I'm running

wgd viz -d global_MRBH.tsv.ks.tsv --extraparanomeks pafricana.chr_longest_isoform.cds.fasta.tsv.ks.tsv --focus2all pafricana.chr_longest_isoform.cds.fasta -sp speciestree.txt -o wgd_viz_mixed_Ks_1 --plotkde  --plotelmm --plotapgmm -ap wgd_syn/iadhore-out/anchorpoints.txt

I've run the program so far (only been about 4 minutes), and the stdout looks like this:

▽
23:35:58 INFO     This is wgd v2.0.26                                  cli.py:32

The only reason I posted this is because I'm used to seeing lines about analysis kick in about a minute after I start the program-unsure if this was an intended change. Nevertheless, I'll ping here later when the job ends! Thanks again for all your help.

heche-psb commented 4 months ago

It is indeed a good suggestion. I have added more log information in a newer commit here but not on PYPI. You can pull as you wish :-)

vidsvur commented 3 months ago

Hi @heche-psb! I'm trying to use wgd v2.0.26, but I'm running into the following error again:

/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/Bio/Seq.py:2855: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  warnings.warn(
Traceback (most recent call last):
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/bin/wgd", line 33, in <module>
    sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd/cli.py", line 588, in viz
    _viz(**kwargs)
  File "/home/FCAM/vvuruputoor/wgd/cli.py", line 629, in _viz
    multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=nodeaveraged,user_xlim=xlim,user_ylim=ylim,adjustortho=adjustortho,adfactor=adjustfactor,okalpha=okalpha,focus2all=focus2all,clean=classic,toparrow=toparrow)
  File "/home/FCAM/vvuruputoor/wgd/wgd/viz.py", line 625, in multi_sp_plot
    ratediffplot(df,outdir,focus2all,sptree,onlyrootout,reweight,extraparanomeks,ap,na=na,elmm=plotelmm,mEM=max_EM_iterations,nEM=num_EM_initializations,pt=peak_threshold,rh=rel_height,components=components,apgmm=plotapgmm)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 784, in ratediffplot
    getspairplot_cov_cor(df,focusp,speciestree,onlyrootout,reweight,extraparanomeks,anchorpoints,outdir,na=na,elmm=elmm,mEM=mEM,nEM=nEM,pt=pt,rh=rh,components=components,apgmm=apgmm)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 746, in getspairplot_cov_cor
    fig,spairs_means_stds_samples,ax_spair,maxim_spair,ks_spair,ax_spair_fp,fig_fp,fig_sigs,ax_sigs = plotspair_cov(df,all_spairs,spairs,focusp,reweight,na=na)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 590, in plotspair_cov
    kde = stats.gaussian_kde(y,weights=w,bw_method=0.1)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/scipy/stats/kde.py", line 193, in __init__
    raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.

My command is: wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv --focus2all lim.cds --extraparanomeks wgd_ksd/lim.cds.tsv.ks.tsv --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.txt -o wgd_viz_mixed_Ks --plotapgmm --plotelmm

speciestree: (ixodes_chr.cds,trich_chr.cds((((tgigas.cds,tach.cds),crot.cds),lim.cds)));

the first few lines of wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv is:

pair    N   S   alignmentcoverage   alignmentidentity   alignmentlength dN  dN/dS   dS  family  g1  g2  l   node    node_averaged_dS_outlierexcluded    node_averaged_dS_outlierincluded    strippedalignmentlength t   weightoutlierexcluded   weightoutlierincluded   gene1   gene2
Cr_038751-T1__ISCP_004587.R5732 571.9   181.1   0.045942999532783   0.0587570621468926  19263   0.0265  0.0011  24.3616 GF00000001  crot.cds_13242  ixodes_chr.cds_03514    -1341.830896    2   NaN 32.34685    885 17.6399 NaN 0.25    Cr_038751-T1    ISCP_004587.R5732
ISCP_004587.R5732__evm.model.Hic_chr_6.1380 570.6   182.4   0.045942999532783   0.0587570621468926  19263   0.0268  0.001   26.7505 GF00000001  tach.cds_11523  ixodes_chr.cds_03514    -1343.942476    2   NaN 32.34685    885 19.4976 NaN 0.25    evm.model.Hic_chr_6.1380    ISCP_004587.R5732
Cr_038751-T1__Ta_00006746-RA    587.4   165.6   0.045942999532783   0.0587570621468926  19263   0.0391  0.001   39.0895 GF00000001  trich_chr.cds_09803 crot.cds_13242  -1353.857349    2   NaN 32.34685    885 25.8776 NaN 0.25    Ta_00006746-RA  Cr_038751-T1
Ta_00006746-RA__evm.model.Hic_chr_6.1380    587.5   165.5   0.045942999532783   0.0587570621468926  19263   0.0392  0.001   39.1858 GF00000001  trich_chr.cds_09803 tach.cds_11523  -1353.561212    2   NaN 32.34685    885 25.9353 NaN 0.25    Ta_00006746-RA  evm.model.Hic_chr_6.1380
ISCP_004587.R5732__g40216   601.8   151.2   0.045942999532783   0.0587570621468926  19263   20.7725 64.9805 0.3197  GF00000001  lim.cds_18572   ixodes_chr.cds_03514    -2028.275088    0   1.39634 1.39634 885 50.0    0.2 0.2 g40216  ISCP_004587.R5732
heche-psb commented 3 months ago

Hi, how did you get your global_MRBH.tsv.ks.tsv file? The errors indicate that not every considered species pair has Ks data available.

vidsvur commented 3 months ago

Ah that's interesting! I did the following command: wgd ksd wgd_globalmrbh/global_MRBH.tsv ../*.cds -o wgd_globalmrbh_ks

My directory has all the seq data saved in the format of a *.cds file!

heche-psb commented 3 months ago

Could you share me with your wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv and wgd_ksd/lim.cds.tsv.ks.tsv files? I will try to reproduce your issue.

vidsvur commented 3 months ago

wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv:

pair    N   S   alignmentcoverage   alignmentidentity   alignmentlength dN  dN/dS   dS  family  g1  g2  l   node    node_averaged_dS_outlierexcluded    node_averaged_dS_outlierincluded    strippedalignmentlength t   weightoutlierexcluded   weightoutlierincluded   gene1   gene2
Cr_038751-T1__ISCP_004587.R5732 571.9   181.1   0.045942999532783   0.0587570621468926  19263   0.0265  0.0011  24.3616 GF00000001  crot.cds_13242  ixodes_chr.cds_03514    -1341.830896    2   NaN 32.34685    885 17.6399 NaN 0.25    Cr_038751-T1    ISCP_004587.R5732
ISCP_004587.R5732__evm.model.Hic_chr_6.1380 570.6   182.4   0.045942999532783   0.0587570621468926  19263   0.0268  0.001   26.7505 GF00000001  tach.cds_11523  ixodes_chr.cds_03514    -1343.942476    2   NaN 32.34685    885 19.4976 NaN 0.25    evm.model.Hic_chr_6.1380    ISCP_004587.R5732
Cr_038751-T1__Ta_00006746-RA    587.4   165.6   0.045942999532783   0.0587570621468926  19263   0.0391  0.001   39.0895 GF00000001  trich_chr.cds_09803 crot.cds_13242  -1353.857349    2   NaN 32.34685    885 25.8776 NaN 0.25    Ta_00006746-RA  Cr_038751-T1
Ta_00006746-RA__evm.model.Hic_chr_6.1380    587.5   165.5   0.045942999532783   0.0587570621468926  19263   0.0392  0.001   39.1858 GF00000001  trich_chr.cds_09803 tach.cds_11523  -1353.561212    2   NaN 32.34685    885 25.9353 NaN 0.25    Ta_00006746-RA  evm.model.Hic_chr_6.1380
ISCP_004587.R5732__g40216   601.8   151.2   0.045942999532783   0.0587570621468926  19263   20.7725 64.9805 0.3197  GF00000001  lim.cds_18572   ixodes_chr.cds_03514    -2028.275088    0   1.39634 1.39634 885 50.0    0.2 0.2 g40216  ISCP_004587.R5732
Cr_038751-T1__g40216    591.1   161.9   0.045942999532783   0.0587570621468926  19263   3.3093  2.8952  1.1431  GF00000001  lim.cds_18572   crot.cds_13242  -1994.804078    0   1.39634 1.39634 885 8.5307  0.2 0.2 g40216  Cr_038751-T1
evm.model.Hic_chr_6.1380__g40216    591.1   161.9   0.045942999532783   0.0587570621468926  19263   3.371   2.9733  1.1338  GF00000001  tach.cds_11523  lim.cds_18572   -1993.633127    0   1.39634 1.39634 885 8.6698  0.2 0.2 evm.model.Hic_chr_6.1380    g40216

wgd_ksd/lim.cds.tsv.ks.tsv:

-bash-4.2$ tail lim.cds.tsv.ks.tsv
g3546__g3547    377.4   105.6   0.688034188034188   0.5714285714285714  702.0   0.5434  0.1186  4.5828  GF00002902  lim.cds_09737   lim.cds_09736   -1136.677534    0.0 4.5828  4.5828  483.0   4.2791  1.0 1.0 g3547   g3546
g4268__g8023    407.2   108.8   0.7889908256880734  0.8255813953488372  654.0   0.0756  0.071   1.0654  GF00002903  lim.cds_11584   lim.cds_02668   -940.777828 0.0 1.0654  1.0654  516.0   0.8529  1.0 1.0 g4268   g8023
g2241__g23648   434.9   132.1   0.9593908629441624  0.7760141093474426  591.0   0.127   0.1125  1.1286  GF00002904  lim.cds_08877   lim.cds_13557   -1147.017268    0.0 1.1286  1.1286  567.0   1.081   1.0 1.0 g2241   g23648
g32887__g43804  274.6   103.4   0.4701492537313433  0.9417989417989416  804.0   0.0476  0.4458  0.1068  GF00002905  lim.cds_25090   lim.cds_26639   -604.196061 0.0 0.1068  0.1068  378.0   0.1914  1.0 1.0 g32887  g43804
g35017__g35019  264.8   92.2    0.5327510917030568  0.5163934426229508  687.0   0.8694  0.7384  1.1775  GF00002906  lim.cds_27755   lim.cds_27754   -904.411027 0.0 1.1775  1.1775  366.0   2.8469  1.0 1.0 g35019  g35017
g31108__g5812   351.6   77.4    0.7566137566137566  0.7948717948717948  567.0   0.1814  0.2357  0.7694  GF00002907  lim.cds_15438   lim.cds_12534   -834.311488 0.0 0.7694  0.7694  429.0   0.8623  1.0 1.0 g31108  g5812
g31875__g4624   241.3   97.7    0.6766467065868264  0.6932153392330384  501.0   0.2596  0.1911  1.3586  GF00002908  lim.cds_15910   lim.cds_11748   -735.369266 0.0 1.3586  1.3586  339.0   1.7292  1.0 1.0 g31875  g4624
g21344__g47291  277.3   94.7    1.0 0.7634408602150538  372.0   0.0258  0.0026  9.9966  GF00002909  lim.cds_01975   lim.cds_08216   -669.771474 0.0 NaN 9.9966  372.0   7.6882  NaN1.0  g21344  g47291
heche-psb commented 3 months ago

Hi, based on the provided files, the Ks data from species pairs lim.cds__tgigas.cds, ixodes_chr.cds__tach.cds, crot.cds__ixodes_chr.cds, ixodes_chr.cds__tgigas.cds is missing. Could you check whether these species pairs have orthologous gene pairs and associated Ks data?

vidsvur commented 3 months ago

Hi, I only see

-bash-4.2$ cd wgd_globalmrbh -bash-4.2$ ls crot.cds_ixodes_chr.cds.rbh.tsv ixodes_chr.cds_lim.cds.rbh.tsv lim.cds_trich_chr.cds.rbh.tsv crot.cds_lim.cds.rbh.tsv ixodes_chr.cds_tach.cds.rbh.tsv tach.cds_tgigas.cds.rbh.tsv crot.cds_tach.cds.rbh.tsv ixodes_chr.cds_tgigas.cds.rbh.tsv tach.cds_trich_chr.cds.rbh.tsv crot.cds_tgigas.cds.rbh.tsv ixodes_chr.cds_trich_chr.cds.rbh.tsv tgigas.cds_trich_chr.cds.rbh.tsv crot.cds_trich_chr.cds.rbh.tsv lim.cds_tach.cds.rbh.tsv global_MRBH.tsv lim.cds_tgigas.cds.rbh.tsv

But I don't see lim.cds__tgigas.cds, ixodes_chr.cdstach.cds, crot.cdsixodes_chr.cds, ixodes_chr.cds__tgigas.cds.. would that be in my wgd_globalmrbh_ks folder?

that folder contains:

-bash-4.2$ cd wgd_globalmrbh_ks -bash-4.2$ ls global_MRBH.tsv.ksd.pdf global_MRBH.tsv.ksd.svg global_MRBH.tsv.ks.tsv

and my wgd_ks folder contains:

-bash-4.2$ cd wgd_ksd -bash-4.2$ ls lim.cds.tsv.ksd.pdf lim.cds.tsv.ksd.svg lim.cds.tsv.ks.tsv

The preview I gave of wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv has all the species pairs, but instead of lim.cds__tgigas.cds, ixodes_chr.cdstach.cds, crot.cdsixodes_chr.cds, ixodes_chr.cds__tgigas.cds the pairs are referred by their gene IDs. Is it easier to send my files over to you to debug? wgd_debug.zip https://drive.google.com/file/d/1tFoJhKYYTzxfk25laaLVpixVND-ghrJ-/view?usp=drive_web Vidya S Vuruputoor

On Tue, 26 Mar 2024 at 04:53, heche-psb @.***> wrote:

Hi, based on the provided files, the Ks data from species pairs lim.cds__tgigas.cds, ixodes_chr.cdstach.cds, crot.cdsixodes_chr.cds, ixodes_chr.cds__tgigas.cds is missing. Could you check whether these species pairs have orthologous gene pairs and associated Ks data?

— Reply to this email directly, view it on GitHub https://github.com/heche-psb/wgd/issues/23#issuecomment-2019853216, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFEH5MR2FSMS47KYAWSR7DY2ESN3AVCNFSM6AAAAABDYC4MVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJZHA2TGMRRGY . You are receiving this because you authored the thread.Message ID: @.***>

heche-psb commented 3 months ago

Hi, I think that's the point. You may try the wgd dmd -gm with all your considered cds files and then calculate the Ks again.

vidsvur commented 3 months ago

@heche-psb I did run wgd dmd --globalmrbh ../*.cds -o wgd_globalmrbh initially. It's re-running now, I'll keep you posted!

heche-psb commented 3 months ago

Thanks, in the meantime I will also rerun the whole procedure based on your shared data to reproduce your issue.

vidsvur commented 3 months ago

It ran into the same error, albeit it a more verbose error this time!

/home/FCAM/vvuruputoor/wgd2.28/wgd/wgd_venv/lib/python3.8/site-packages/scipy/stats/kde.py in __init__(self=<scipy.stats.kde.gaussian_kde object>, dataset=[], bw_method=0.1, weights=[])
    188
    189     """
    190     def __init__(self, dataset, bw_method=None, weights=None):
    191         self.dataset = atleast_2d(asarray(dataset))
    192         if not self.dataset.size > 1:
--> 193             raise ValueError("`dataset` input should have multiple elements.")
    194
    195         self.d, self.n = self.dataset.shape
    196
    197         if weights is not None:

ValueError: `dataset` input should have multiple elements.
___________________________________________________________________________

My log file shows this:

18:21:33 INFO     This is wgd v2.0.28                                  cli.py:32
18:21:36 INFO     Reading species tree and categorizing       ratecorrect.py:876
                  sister&outgroup species
         INFO     Composing trios (outgroup,(focal,sister))   ratecorrect.py:889
         INFO     Sampling, calculating and plotting 200      ratecorrect.py:892
                  bootstrap replicates for 7 orthologous Ks
                  distributions using 4 threads (which might
                  take a while..)
         INFO     Note that the number of bootstrap           ratecorrect.py:893
                  replicates can be adjusted via the option
                  --bootstrap
heche-psb commented 3 months ago

Hi, I just reproduced your issue, which was caused by the kde estimation upon data with only one Ks datapoint (which is strictly supposed to have multiple datapoints). I have fixed it and pushed it as v2.0.29, please install the latest and have a try. The ELMM analysis of my run is still ongoing and I will post the final result here once I got it.

You should have the log information like below:

$wgd viz -d test_ksd/global_MRBH.tsv.ks.tsv -epk wgd_ksd/lim.cds.tsv.ks.tsv -sp speciestree.txt -fa lim.cds --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt -o test_viz --plotapgmm --plotelmm
10:10:54 INFO     This is wgd v2.0.29                                                                                                  cli.py:32
10:10:55 INFO     Reading species tree and categorizing sister&outgroup species                                               ratecorrect.py:890
         INFO     Composing trios (outgroup,(focal,sister))                                                                   ratecorrect.py:903
         INFO     Sampling, calculating and plotting 200 bootstrap replicates for 7 orthologous Ks distributions using 4      ratecorrect.py:906
                  threads (which might take a while..)
         INFO     Note that the number of bootstrap replicates can be adjusted via the option --bootstrap                     ratecorrect.py:907
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-heche'
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 38.09it/s]
10:10:59 INFO     Sampling done, now making plots                                                                             ratecorrect.py:714
10:11:03 INFO     Species pair crot.cds__ixodes_chr.cds has only one unique orthologous Ks datapoint,                         ratecorrect.py:740
                  a random datapoint with negligible difference is thus added
         INFO     Species pair ixodes_chr.cds__tach.cds has only one unique orthologous Ks datapoint,                         ratecorrect.py:740
                  a random datapoint with negligible difference is thus added
10:11:06 INFO     Synonymous substitution rate correction done (info written in output)                                       ratecorrect.py:910
                  Now adding correction shade onto plots
10:12:57 INFO     Plotting the final mixed Ks distribution                                                                    ratecorrect.py:934
10:13:01 INFO     ELMM analysis on paralogous Ks distribution                                                                 ratecorrect.py:398
10:13:02 INFO     An extra buffer lognormal component with mean 5.00 and std 0.30 is appended                                 ratecorrect.py:287
         INFO     Found 4 likely peak signals                                                                                 ratecorrect.py:290
         INFO     The initiative means and stds is 0.12 0.62                                                                  ratecorrect.py:291
         INFO     The initiative means and stds is 0.22 0.75                                                                  ratecorrect.py:291
         INFO     The initiative means and stds is 4.26 0.75                                                                  ratecorrect.py:291
         INFO     The initiative means and stds is 5.00 0.30                                                                  ratecorrect.py:291
         INFO     Performing EM algorithm from initializated data (Model1)                                                    ratecorrect.py:297
10:13:06 INFO     BIC of Model1: 22755.45                                                                                     ratecorrect.py:303
         INFO     Performing EM algorithm from initializated data plus a random lognormal component (Model2)                  ratecorrect.py:305
vidsvur commented 3 months ago

Okay! Installed and re-running, will message if there is any error. Thank you so much, @heche-psb :)

heche-psb commented 3 months ago

Mixed.ks.lim.cds.node.weighted.pdf

vidsvur commented 3 months ago

I've been running wgd ever since, and I'm still running into errors. It feels like two steps back, but I'm unable to generate wgd_globalmrbh_final/global_MRBH.tsv

heche-psb commented 3 months ago

I think your previous command is correct.

wgd dmd --globalmrbh *.fasta -o wgd_globalmrbh

vidsvur commented 3 months ago
source ~/wgd2_2.0.29/wgd2.0.29/bin/activate
module load python/3.8.1

export PATH=$PATH:/home/FCAM/vvuruputoor/i-adhore-3.0.01/bin
wgd dmd --globalmrbh *.cds -o wgd_globalmrbh_final

Hi @heche-psb, I'm re-running wgd with these files:

crot.cds  ixodes_chr.cds  lim.cds  tgigas.cds  trich_chr.cds

What I find a bit fishy is that this step has taken over 24 hours, and still has not generated the global_MRBH.tsv file

heche-psb commented 3 months ago

I think it's because the size of cds files is big. A quick test is that you can only retain 100 genes per cds file and then conduct the same command to see the running time. If with only 100 genes per cds file it's still very slow, then I will take a closer look at that relevant code block. Thanks for you patience! :-)

vidsvur commented 3 months ago

@heche-psb - haha thanks for that one! :) I upped the memory and it ran through.. I'm finally at the wgd viz stage now.

Command: wgd viz -d wgd_globalmrbh_ks_final/global_MRBH.tsv.ks.tsv --focus2all lim.cds --extraparanomeks wgd_ksd/lim.cds.tsv.ks.tsv --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.txt -o wgd_viz_mixed_Ks_final --plotapgmm --plotelmm

Species tree: (ixodes_chr.cds,trich_chr.cds,((tgigas.cds,crot.cds),lim.cds)); note the removal of tach.cds~

CDS in my dir: crot.cds ixodes_chr.cds lim.cds tgigas.cds trich_chr.cds

Error:

Traceback (most recent call last):
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/bin/wgd", line 33, in <module>
    sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/cli.py", line 597, in viz
    _viz(**kwargs)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/cli.py", line 638, in _viz
    multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=nodeaveraged,user_xlim=xlim,user_ylim=ylim,adjustortho=adjustortho,adfactor=adjustfactor,okalpha=okalpha,focus2all=focus2all,clean=classic,toparrow=toparrow,BT=bootstrap,nthreads=nthreads)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/viz.py", line 625, in multi_sp_plot
    ratediffplot(df,outdir,focus2all,sptree,onlyrootout,reweight,extraparanomeks,ap,na=na,elmm=plotelmm,mEM=max_EM_iterations,nEM=num_EM_initializations,pt=peak_threshold,rh=rel_height,components=components,apgmm=plotapgmm,BT=BT)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/ratecorrect.py", line 948, in ratediffplot
    getspairplot_cov_cor(df,focusp,speciestree,onlyrootout,reweight,extraparanomeks,anchorpoints,outdir,na=na,elmm=elmm,mEM=mEM,nEM=nEM,pt=pt,rh=rh,components=components,apgmm=apgmm,BT=BT,nthreads=nthreads)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/ratecorrect.py", line 905, in getspairplot_cov_cor
    else: all_spairs,spairs,Trios,Trios_dict = gettrios_overall(focusp,Ingroup_spnames,Outgroup_spnames,Ingroup_clade)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/ratecorrect.py", line 810, in gettrios_overall
    mrca = Ingroup_clade.common_ancestor({"name": sister}, {"name": focusp})
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/Bio/Phylo/BaseTree.py", line 469, in common_ancestor
    raise ValueError("target %s is not in this tree" % repr(t))
ValueError: target 'name' is not in this tree
vidsvur commented 3 months ago

okay- so maybe this is a result of my speciestree not being represented correctly. When i change up my tree a bit: ((trich_chr.cds, ((tgigas.cds, crot.cds), lim.cds)), ixodes_chr.cds);

i run into a longer error message, but the end of the message is:

/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/scipy/linalg/basic.py in inv(a=array([[0.]]), overwrite_a=False, check_finite=True)
    972         # all tests pass. Further investigation is required if
    973         # more such SEGFAULTs occur.
    974         lwork = int(1.01 * lwork)
    975         inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
    976     if info > 0:
--> 977         raise LinAlgError("singular matrix")
    978     if info < 0:
    979         raise ValueError('illegal value in %d-th argument of internal '
    980                          'getrf|getri' % -info)
    981     return inv_a

LinAlgError: singular matrix
heche-psb commented 3 months ago

global_MRBH.tsv.ks.csv speciestree.txt

The above is my input files for my last successful run. Could you try it on your system and see if it works? (Please change the global_MRBH.tsv.ks.csv into global_MRBH.tsv.ks.tsv) The command can be like below:

$wgd viz -d global_MRBH.tsv.ks.tsv -epk wgd_ksd/lim.cds.tsv.ks.tsv -sp speciestree.txt -fa lim.cds --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt -o test_viz --plotapgmm --plotelmm

heche-psb commented 3 months ago

Besides, the direct reason of the error LinAlgError: singular matrix is that some species pairs have only 1 Ks datapoint.

vidsvur commented 3 months ago

Do you recommend that I remove those species pairs?

vidsvur commented 3 months ago

Thanks for the above files! Will run a wgd viz on those files.

heche-psb commented 3 months ago

Do you recommend that I remove those species pairs?

It is indeed an alternative solution that you could use local MRBHs joint at the focal species lim.cds, instead of using the global MRBHs which are too stringent. For that you need to run wgd dmd crot.cds ixodes_chr.cds lim.cds tgigas.cds trich_chr.cds -f lim.cds and then wgd ksd.

vidsvur commented 3 months ago

Okay! I was able to get through the first two steps, but I'm hitting an error with wgd viz

My script so far:

#wgd dmd crot.cds ixodes_chr.cds lim.cds tgigas.cds trich_chr.cds -f lim.cds -o wgd_dmd_lim
#wgd ksd wgd_dmd_lim/merge_focus.tsv *.cds -o wgd_lim_ks
wgd viz -d wgd_dmd_lim/merge_focus.tsv --focus2all lim.cds --extraparanomeks wgd_ksd/lim.cds.tsv.ks.tsv -sp ../speciestree.txt -o wgd_viz_mixed_Ks_try1 --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt --plotapgmm --plotelmm

I'm hitting an error with wgd viz:

Traceback (most recent call last):
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/bin/wgd", line 33, in <module>
    sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/cli.py", line 597, in viz
    _viz(**kwargs)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/cli.py", line 633, in _viz
    ksdb_df = formatv2(ksdb_df)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/utils.py", line 27, in formatv2
    if "weightoutlierexcluded" not in ksdf.columns: weight_inc = get_outlierincluded(ksdf)
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd/utils.py", line 48, in get_outlierincluded
    weight_inc = 1/df.groupby(['family', 'node'])['dS'].transform('count')
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/pandas/core/frame.py", line 7721, in groupby
    return DataFrameGroupBy(
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 882, in __init__
    grouper, exclusions, obj = get_grouper(
  File "/home/FCAM/vvuruputoor/wgd2_2.0.29/wgd2.0.29/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 882, in get_grouper
    raise KeyError(gpr)
KeyError: 'family'
heche-psb commented 3 months ago

Hi, it should be the Ks file to be provided for wgd viz. You may change wgd viz -d wgd_dmd_lim/merge_focus.tsv into wgd viz -d wgd_lim_ks/merge_focus.tsv.ks.tsv.

vidsvur commented 3 months ago

Ah, should've caught this sooner. Apologies!

vidsvur commented 2 months ago

Thank you so much! I could create all plots!!🙏