heche-psb / wgd

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication
https://wgdv2.readthedocs.io/en/latest/
GNU General Public License v3.0
25 stars 0 forks source link

wgd dmd #31

Closed Malabady closed 5 months ago

Malabady commented 6 months ago

Hi, I trying to analysis a single species using WGD V2. In the species delineation step, I am sure how to generate the families file that I need in the Ks step. I tried all relevant wgd dmd options, but none of them writes a "families" single file. I am sure I am missing something here. I'd really appreciate your answers.

best,

heche-psb commented 6 months ago

Hi, are you up to a whole paranome? that a simple command wgd dmd $file should do.

Malabady commented 6 months ago

Hi. Many thanks for the quick reply. Yes, I am tying to do a whole paranome. I run the command "wgd dmd cds.fa" but it didnt produce "familes" file. do I need to pass the dmd output dir to ks?

heche-psb commented 6 months ago

Hi, the "familes" file is what you get as the result from the wgd dmd $file command actually.

Malabady commented 6 months ago

Thanks a lot for your prompt answers. When I run wgd dmd cds.fa, it produces cds.fa.tsv, but not families.mcl. In the wgd ksd -h, it provides the following whole-paranome example wgd ksd families.mcl cds.fasta. that's why I was looking for a file explicitly names families.mcl.

So, if I understand you correctly, the file cds.fa.tsv is the same as families.mcl, correct?

My goal from this analysis is to identify and date the WGD event(s) in my genome. towards this goal, do you think using the protein sequences will perform better than the cds sequences.

Malabady commented 6 months ago

I wend ahead and tested wgd ksd pep.cds.tsv cds.fa, but got many of the following warnings:

2024-04-24 19:14:17 WARNING Stripped alignment length == 0 for GF00000004 codeml.py:274 INFO Analysing family GF00000005 core.py:3064 followed by the following error:

`` multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in call return self.func(*args, kwargs) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in call return [func(*args, *kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in return [func(args, kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3221, in _get_ks family.get_ks() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3066, in get_ks self.run_codeml() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3149, in run_codeml preserve=True, times=self.codeml_iter) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/codeml.py", line 288, in run_codeml results = _run_codeml(self.exe, self.control_file, self.out_file, *kwargs) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/codeml.py", line 126, in _run_codeml sp.run([exe, control_file], stdout=sp.PIPE) File "/usr/lib64/python3.6/subprocess.py", line 423, in run with Popen(popenargs, **kwargs) as process: File "/usr/lib64/python3.6/subprocess.py", line 729, in init restore_signals, start_new_session) File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'codeml': 'codeml'

`` I am assuming this has to do with the families. Correct?

heche-psb commented 6 months ago

Thanks a lot for your prompt answers. When I run wgd dmd cds.fa, it produces cds.fa.tsv, but not families.mcl. In the wgd ksd -h, it provides the following whole-paranome example wgd ksd families.mcl cds.fasta. that's why I was looking for a file explicitly names families.mcl.

So, if I understand you correctly, the file cds.fa.tsv is the same as families.mcl, correct?

My goal from this analysis is to identify and date the WGD event(s) in my genome. towards this goal, do you think using the protein sequences will perform better than the cds sequences.

To construct a Ks distribution, a cds file is essential. Protein file can be used in the final phylogenetic dating step, but optionally. The cds.fa.tsv is the whole paranome family file that you can give to wgd ksd for the construction of Ks distribution.

I wend ahead and tested wgd ksd pep.cds.tsv cds.fa, but got many of the following warnings:

2024-04-24 19:14:17 WARNING Stripped alignment length == 0 for GF00000004 codeml.py:274 INFO Analysing family GF00000005 core.py:3064 followed by the following error:

`` multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in call return self.func(*args, kwargs) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in call return [func(*args, *kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in return [func(args, kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3221, in _get_ks family.get_ks() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3066, in get_ks self.run_codeml() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3149, in run_codeml preserve=True, times=self.codeml_iter) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/codeml.py", line 288, in run_codeml results = _run_codeml(self.exe, self.control_file, self.out_file, *kwargs) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/codeml.py", line 126, in _run_codeml sp.run([exe, control_file], stdout=sp.PIPE) File "/usr/lib64/python3.6/subprocess.py", line 423, in run with Popen(popenargs, kwargs) as process: File "/usr/lib64/python3.6/subprocess.py", line 729, in init** restore_signals, start_new_session) File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'codeml': 'codeml'

`` I am assuming this has to do with the families. Correct?

The log seems to indicate that you didn't preinstall paml in your environment variables. You may install wgd v2 via bioconda to easily fulfill the dependency requirement.

Malabady commented 6 months ago

Thanks. I was thinking of using the protein file with dmd since it has a protein input option and proteins are more suitable for inferring orthogroups. then, with the ksd, i will use the cds file.

When I use the cds as input for dmd, i get the following warning: /apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/Bio/Seq.py:2859: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.

Do i need to trim the sequences or wgd is doing that already? this warning is what made me think of using the protein file to generate the whole panarome family.

for some reasons, installing using conda on our HPC is not working and they had to install the tool from the source using the described method:

git clone https://github.com/heche-psb/wgd cd wgd virtualenv -p=python3 ENV (or python3 -m venv ENV) source ENV/bin/activate python setup.py install (or python -m pip install -r requirements.txt) pip install .

We installed PAML withing the wgd environment, which seems to be working. But when I run the ksd step, i get tons of the following warnings:

`` INFO Analysing family GF00009014 core.py:3064 2024-04-25 12:49:53 WARNING No codeml result for GF00009011 due to no resolved nucleotides codeml.py:291 INFO Analysing family GF00009015 core.py:3064 2024-04-25 12:49:53 WARNING No codeml result for GF00009012 due to no resolved nucleotides codeml.py:291 INFO Analysing family GF00009016 core.py:3064 2024-04-25 12:49:53 WARNING No codeml result for GF00009013 due to no resolved nucleotides codeml.py:291

``

the error log from yesterday is not showing up, so it seems that PAML is working, but the ksd analysis ends with the following messag:

2024-04-25 12:50:16 INFO Saving to wgd_ksd/spu.cds_extracted.fa.tsv.ks.tsv cli.py:515 2024-04-25 12:50:17 INFO Making plots cli.py:517 INFO No valid Ks values for plotting

heche-psb commented 6 months ago

Hi, as I marked in the requirement on bioconda, wgd v2 works with paml 4.9j. May you install the paml 4.9j instead and try again? Normally the length of a coding sequence (cds) should be dividable by three but incomplete cds is being seen from time to time which can be handled well in biopython library and thus in wgd v2. You can do trimming yourself or just leave it with biopython.

Malabady commented 6 months ago

Hi, we installed paml 4.9j, but it didn't solve the problem. In fact, it produced the same error that it produced before installing paml, which is as follows:

multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in __call__ return self.func(*args, **kwargs) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in __call__ return [func(*args, **kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in <listcomp> return [func(*args, **kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3221, in _get_ks family.get_ks() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3068, in get_ks self.get_tree() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3161, in get_tree tree = self.run_fasttree(options=self.tree_options) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3175, in run_fasttree out = sp.run(cmd, stdout=sp.PIPE, stderr=sp.PIPE) File "/usr/lib64/python3.6/subprocess.py", line 423, in run with Popen(*popenargs, **kwargs) as process: File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__ restore_signals, start_new_session) File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'FastTree': 'FastTree'

I even tried on local linux machine. I installed it via pip and got successful installation messages as follows:

Successfully installed Jinja2-2.11.2 KDEpy-1.1.0 MarkupSafe-1.1.1 Pillow-8.4.0 PyQt5_sip-12.13.0 PyYAML-5.3.1 Pygments-2.7.2 attrs-20.3.0 biopython-1.76 click-7.1.2 colorama-0.4.4 commonmark-0.9.1 cycler-0.10.0 ete3-3.1.3 fastcluster-1.1.28 humanfriendly-8.2 importlib-metadata-3.1.0 iniconfig-1.1.1 joblib-0.11 kiwisolver-1.2.0 matplotlib-3.2.2 numexpr-2.8.6 packaging-20.4 pandas-1.4.4 pluggy-0.13.1 plumbum-1.6.9 progressbar2-3.51.4 py-1.9.0 pyparsing-2.4.7 pyqt5-5.12.3 pyqtwebengine-5.12.1 pytest-6.1.2 python-dateutil-2.9.0.post0 python-utils-2.4.0 pytz-2020.1 rich-12.5.1 scikit-learn-0.23.1 scikit-learn-extra-0.2.0 scipy-1.5.4 seaborn-0.10.1 six-1.15.0 threadpoolctl-2.1.0 toml-0.10.2 tornado-6.0.4 tqdm-4.66.2 typing_extensions-4.11.0 zipp-3.4.0
->wgd==2.0.33) (12.13.0)
Building wheels for collected packages: wgd
  Building wheel for wgd (setup.py) ... done
  Created wheel for wgd: filename=wgd-2.0.33-py3-none-any.whl size=197849 sha256=ff51712e93564fa31da40273677ff49a7b80eed5ac4bb31017c44dca95674da3
  Stored in directory: /tmp/pip-ephem-wheel-cache-t_n4t7yp/wheels/32/db/48/ecef9106010a0615069e0b61866d6d5ed83cc5d83710ced609
Successfully built wgd
Installing collected packages: wgd
Successfully installed wgd-2.0.33

but when I started wgd dmd, i get the following errors:

Traceback (most recent call last):
  File "/home/malabady/mambaforge/envs/wgd/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/cli.py", line 117, in dmd
    _dmd(**kwargs)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/cli.py", line 148, in _dmd
    s[0].get_paranome(inflation=inflation, eval=eval)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/core.py", line 458, in get_paranome
    mcl_out = gf.run_mcl(inflation=inflation)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/core.py", line 546, in run_mcl
    out = sp.run(command, stdout=sp.PIPE, stderr=sp.PIPE)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/subprocess.py", line 1720, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mcxload'

I even tried to install from the bioconda channel conda install wgd -c bioconda, but it didn't install all dependencies. And when tried to just wgd -h, it kept on complaining about every single module in the requirement file.

heche-psb commented 6 months ago

Hi, the log information "FileNotFoundError: [Errno 2] No such file or directory: 'FastTree': 'FastTree'" indicates that you haven't installed the software fasttree. Besides fasttree, mafft is also required for wgd ksd. The log information "FileNotFoundError: [Errno 2] No such file or directory: 'mcxload'" indicates that you haven't installed the software mcl. I have just added note for the dependency of each program in the README. The bioconda should work if you set a separate virtual environment for wgd v2 because your pre-installed python packages might complain about the incompatibility for the requirement of wgd v2 as you mentioned.

Malabady commented 6 months ago

Hello .. all my attempts were in conda environment specific for this program. For now, I want to understand whether the following "warning" indicates an installation issues:

2024-04-26 11:49:28 WARNING Stripped alignment length == 0 for GF00000034 codeml.py:274 INFO Analysing family GF00000037

heche-psb commented 6 months ago

Hi, the log is not an indication of any installation issue but a normal log informing that the stripped alignment (i.e., after dropping all gap-containing sites) length is zero such that no codeml analysis will be conducted upon this gene family, which is often seen in big gene families where the extent to which family members diverge with each other is large.

Malabady commented 6 months ago

Thank you! I think it might be working at last. the wgd ksd has been running for a couple of hours now (forgot to increase the # processors) with only some of those warnings. In the previous attempts, it used to throw errors within a few minutes.

Assuming it will finish successfully this time, do you think the default command wgd ksd Families is sufficient for whole panarome analysis?

heche-psb commented 6 months ago

Hi, yes I think it will be sufficient and note that increasing the number of threads will significantly speed up the process because it is equal to the number of gene families that can be analyzed by codeml at the same time.

Malabady commented 6 months ago

Hello, If I am running one species (whole panarome), can the wgd peak be used without the following options:

-ap, --anchorpoints, the anchor points datafile, default None -sm, --segments, the segments datafile, default None -le, --listelements, the listsegments datafile, default None -mp, --multipliconpairs, the multipliconpairs datafile, default None

By now I have got all parts of the pipeline to work. Many thanks for your help,

Malabady commented 6 months ago

Another question: in the wgd syn analysis, which ks results is preferred? I see that ksd, mix, and peak produce a version of the ks calculations. I am assuming the one from the peak analysis is the most optimized and should be used. Please correct me if I am wrong?

heche-psb commented 6 months ago

Hi, wgd peak is mainly for the purpose of finding crediable range of anchor Ks for WGD dating (either in the high mass region of each component or the 95% CI of the lognormal component, and either in the manner of original Ks or segment Ks) via our peak-finding and clustering pipeline thus it's better to provide all those collinearity results to wgd peak, while wgd mix is mainly inherited from wgd v1 for GMM analysis which only requires a Ks datafile. But neither of them produces Ks data, which can only be done by wgd ksd. You may provide the Ks result from wgd ksd to wgd syn.

Malabady commented 6 months ago

Hello: when I provide the collinearity results to wgd peak, i get an error saying list of index out of range, see below. I don't get this error when I run wgd peak without collinearity results. so, I assume the error has to do with some of the synteny files.

$ wgd peak wgd_ksd/Spu.cds.ks \
--anchorpoints wgd_syn/iadhore-out/anchorpoints.txt \
--segments wgd_syn/iadhore-out/segments.txt \
--listelements wgd_syn/iadhore-out/list_elements.txt \
--multipliconpairs wgd_syn/iadhore-out/multiplicon_pairs.txt \
--weighted
2024-04-28 10:49:27 INFO     This is wgd v2.0.33                                                                                                             cli.py:34
                    INFO     Checking cores and threads...                                                                                                  core.py:35
                    INFO     The number of logical CPUs/Hyper Threading in the system: 12                                                                   core.py:36
                    INFO     The number of physical cores in the system: 12                                                                                 core.py:37
                    INFO     The number of actually usable CPUs in the system: 12                                                                           core.py:38
                    INFO     Checking memory...                                                                                                             core.py:40
                    INFO     Total physical memory: 47.0204 GB                                                                                              core.py:41
                    INFO     Available memory: 39.5765 GB                                                                                                   core.py:42
                    INFO     Free memory: 30.4002 GB                                                                                                        core.py:43
2024-04-28 10:49:28 INFO     Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data                                                                 peak.py:1742
2024-04-28 10:49:29 INFO     Fitting GMM with 1 components                                                                                                 peak.py:754
2024-04-28 10:49:30 INFO     Convergence reached                                                                                                           peak.py:757
                    INFO     Component 1 has mean 0.167, std 2.625, weight 1.000, precision 0.145                                                           peak.py:68
                    INFO     Fitting GMM with 2 components                                                                                                 peak.py:754
2024-04-28 10:49:33 INFO     Convergence reached                                                                                                           peak.py:757
                    INFO     Component 1 has mean 1.305, std 0.667, weight 0.461, precision 2.247                                                           peak.py:68
                    INFO     Component 2 has mean 0.029, std 2.391, weight 0.539, precision 0.175                                                           peak.py:68
                    INFO     Fitting GMM with 3 components                                                                                                 peak.py:754
2024-04-28 10:49:36 INFO     Convergence reached                                                                                                           peak.py:757
                    INFO     Component 1 has mean 0.034, std 1.498, weight 0.441, precision 0.446                                                           peak.py:68
                    INFO     Component 2 has mean 1.275, std 0.690, weight 0.512, precision 2.099                                                           peak.py:68
                    INFO     Component 3 has mean 0.000, std 0.363, weight 0.048, precision 7.582                                                           peak.py:68
                    INFO     Fitting GMM with 4 components                                                                                                 peak.py:754
2024-04-28 10:49:40 INFO     Convergence reached                                                                                                           peak.py:757
                    INFO     Component 1 has mean 1.250, std 0.696, weight 0.529, precision 2.065                                                           peak.py:68
                    INFO     Component 2 has mean 0.014, std 1.158, weight 0.253, precision 0.746                                                           peak.py:68
                    INFO     Component 3 has mean 0.000, std 0.367, weight 0.048, precision 7.406                                                           peak.py:68
                    INFO     Component 4 has mean 0.093, std 0.804, weight 0.169, precision 1.549                                                           peak.py:68
                    INFO     The best fitted model via AIC is with 3 components                                                                            peak.py:763
                    INFO     Rules-of-thumb (Burnham & Anderson, 2002) compares the AIC-best model and remaining:                                           peak.py:79
                    INFO     model with 1 components gets essentially no support comparing to the AIC-best model                                            peak.py:97
                    INFO     model with 2 components gets essentially no support comparing to the AIC-best model                                            peak.py:97
                    INFO     model with 4 components gets essentially no support comparing to the AIC-best model                                            peak.py:97
                    INFO     The best fitted model via BIC is with 3 components                                                                            peak.py:765
                    INFO     Rules-of-thumb (Kass & Raftery, 1995) evaluates the outperformance of the BIC-best model over remaining:                      peak.py:101
                    INFO     Such outperformance is very strongly evidenced for model with 1 components                                                    peak.py:113
                    INFO     Such outperformance is very strongly evidenced for model with 2 components                                                    peak.py:113
                    INFO     Such outperformance is very strongly evidenced for model with 4 components                                                    peak.py:113
Warning: Ignoring XDG_SESSION_TYPE=wayland on Gnome. Use QT_QPA_PLATFORM=wayland to run on Wayland anyway.
2024-04-28 10:49:46 INFO     Detecting likely peaks from node-weighted data of 1-Components Model Peak 0                                                  peak.py:1316
2024-04-28 10:49:53 INFO     Detecting likely peaks from node-weighted data of 2-Components Model Peak 0                                                  peak.py:1316
                    INFO     Detecting likely peaks from node-weighted data of 2-Components Model Peak 1                                                  peak.py:1316
2024-04-28 10:50:03 INFO     Detecting likely peaks from node-weighted data of 3-Components Model Peak 0                                                  peak.py:1316
                    INFO     Detecting likely peaks from node-weighted data of 3-Components Model Peak 1                                                  peak.py:1316
                    INFO     Detecting likely peaks from node-weighted data of 3-Components Model Peak 2                                                  peak.py:1316
Traceback (most recent call last):
  File "~/mambaforge/envs/wgd/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/cli.py", line 351, in peak
    _peak(**kwargs)
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/cli.py", line 372, in _peak
    df_ap_mp = fit_apgmm_guide(hdr,guide,anchorpoints,ksdf,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,segment=segments,multipliconpairs=multipliconpairs,listelement=listelements,cutoff = kscutoff,user_xlim=xlim,user_ylim=ylim,peak_threshold=prominence_cutoff,rel_height=rel_height,keeptmp=keeptmpfig)
  File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/peak.py", line 1798, in fit_apgmm_guide
    get_peak_SegGuidedGMM(guide,kdes,scalings,fig2,ax2,df_c,n,outdir,peak_threshold,weighted,rel_height,keeptmp=keeptmp)
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/peak.py", line 1865, in get_peak_SegGuidedGMM
    lower_CI, upper_CI = find_apeak(df,None,None,outdir,peak_threshold=peak_threshold,na=na,rel_height=rel_height,withfig=False,loglabel="{}-Components Model Peak {}".format(n,cp),keeptmp=keeptmp)
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/peak.py", line 1324, in find_apeak
    lower95CI,upper95CI = noplot_95CI_lognorm_hist(init_means, init_stdevs, ci=ci)
  File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/peak.py", line 1481, in noplot_95CI_lognorm_hist
    return CI_95s[0][0],CI_95s[0][1]
IndexError: list index out of range
heche-psb commented 6 months ago

Hi, I didn't encounter this type of error on my test dataset before. Could you please share me with your input files? I will try to reproduce your error.

Malabady commented 6 months ago

Yes. Please find the attached files. toShare.tar.gz

heche-psb commented 6 months ago

Hi, the segments.txt file is missing.

Malabady commented 6 months ago

Sorry.. added it

toShare.tar.gz

heche-psb commented 6 months ago

Hi, I did reproduce your error, which is due to an unanticipated failure of finding peak in the segment-guided Ks GMM analysis when it is mapped back to anchor pairs. I have now fixed the error and updated it in v2.0.34. Please install the latest version and try again.

Malabady commented 6 months ago

Many thanks. will the installation still work with conda? do I need to rebuild the environment again?

Malabady commented 6 months ago

also, do I need redo the ks, dmd, syn, and mix analyses?

heche-psb commented 6 months ago

Since I only modified the concerned part of code, you only need to update the version of wgd while the environment that you built previously should remain unchanged. No additional analysis is needed because it only affects that specific analysis in wgd peak. The updated version in bioconda should be available within one day, while on PYPI or from source code here it's there already.

Malabady commented 6 months ago

The updated version v2.0.34 worked at my end. so, no issues.

In your example of Aquilegia coerulea, you used the --globalmrbh method to delineate the orthologos. I assume you could have done the same with Orthofinder. Is there a particular reason for using global MRBH?

heche-psb commented 6 months ago

Hi, OrthoFinder2 has two major steps in orthogroup delineation, first performing all-against-all sequence similarity search with gene length-normalization and defining seed orthologues as RBHs (just as InParanoid does), second conducting pre-clustering filtering over low-quality edges (homologous relationship that is assumed to exist only between gene family clusters rather than within clusters) to get the preliminary graph and then carrying out the graph-based clustering using MCL. Throughout the whole process, the most credible orthologues are RBHs, underpinned by the highest sequence similarity reciprocally. Thus, when it comes to the construction of orthologous Ks distribution, RBHs (for instance global MRBHs) should be the stable and reliable proxy instead of orthologues which are delineated upon them conditioning on specific parameters for instance inflation factor and certain genomes/transcriptomes upon the effect of seed orthologoues.

Malabady commented 6 months ago

Hi, thanks for the clarification. I run wgd dmd --globalmrbh on four cds dataset from my species with focus -f on my species. it produces 4848 gene families. then I ran ''wgd ksd`` with 12 threads but the run stock at analyzing family #3211. I cleaned the space and rerun the analysis again couple times, but it always gets stuck at this particular family number. the run doesn't abort or anything, it just stays at this step, see below, for days. I check the processors and all threads associated with the have an "S+" status.

2024-05-01 12:31:18 INFO Analysing family GF00003209 core.py:3064 2024-05-01 12:31:22 INFO Analysing family GF00003210 core.py:3064 2024-05-01 12:31:26 INFO Analysing family GF00003211 core.py:3064 Any suggestions?

heche-psb commented 6 months ago

Hi, I haven't encoutered the condition that the wgd ksd program hangs simply because of one gene family. Normally the program just ends or casts errors. I guess it might be due to memory overflow but also likely due to that particular gene family GF00003211 as you mentioned. So as to debug, I would suggest you of using only one thread but increase the memory. If with sufficient memory it still fails, then we shall look detailedly into the data, since no warning or error coming from the program itself while more than 3000 gene families actually worked fine.

Malabady commented 6 months ago

I think the problem has to do the memory allocation. When I run the job with -n 12, it stopped at family 3211. but when I run it with -n 1 or give no specific n numbers, it stops at family 1350. See the following psutil log:

`` 08:48:01 $ wgd ksd wgd_globalmrbh/global_MRBH.tsv Spu.cds Dlo.cds Sly.cds Cfo.cds -o wgd_globalmrbh_ks 2024-05-03 08:48:26 INFO This is wgd v2.0.34 cli.py:34 INFO Checking cores and threads... core.py:35 INFO The number of logical CPUs/Hyper Threading in the system: 12 core.py:36 INFO The number of physical cores in the system: 12 core.py:37 INFO The number of actually usable CPUs in the system: 12 core.py:38 INFO Checking memory... core.py:40 INFO Total physical memory: 47.0204 GB core.py:41 INFO Available memory: 36.5698 GB core.py:42 INFO Free memory: 19.2054 GB core.py:43 /home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/Bio/Seq.py:2855: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. warnings.warn( 2024-05-03 08:49:15 INFO tmpdir = wgdtmp_f845e8c0-8fd0-4f17-95c1-9d150fd037e2 cli.py:492 2024-05-03 08:49:17 INFO 4 threads are used for 4848 gene families core.py:3241


the machine has 12 cores and 48GB RAM. 

Thoughts?
heche-psb commented 6 months ago

Hi, I think setting the number of thread as 1 and allocating the job with sufficient memory is the solution.

Malabady commented 6 months ago

I don't see any memory flags in the command line options. is there a way to calculate there required memory based on the OG families?

heche-psb commented 6 months ago

This step involves external softwares for instance mafft, fasttree, paml, which makes it hard to give an accurate estimation for the required memory. So far wgd v2 doesn't provide option to set the memory yet, which can only be set in the script of job submitting.