Closed Malabady closed 5 months ago
Hi, are you up to a whole paranome? that a simple command wgd dmd $file
should do.
Hi. Many thanks for the quick reply. Yes, I am tying to do a whole paranome. I run the command "wgd dmd cds.fa" but it didnt produce "familes" file. do I need to pass the dmd output dir to ks?
Hi, the "familes" file is what you get as the result from the wgd dmd $file
command actually.
Thanks a lot for your prompt answers.
When I run wgd dmd cds.fa
, it produces cds.fa.tsv
, but not families.mcl
. In the wgd ksd -h
, it provides the following whole-paranome example wgd ksd families.mcl cds.fasta
. that's why I was looking for a file explicitly names families.mcl
.
So, if I understand you correctly, the file cds.fa.tsv
is the same as families.mcl
, correct?
My goal from this analysis is to identify and date the WGD event(s) in my genome. towards this goal, do you think using the protein sequences will perform better than the cds sequences.
I wend ahead and tested wgd ksd pep.cds.tsv cds.fa
, but got many of the following warnings:
2024-04-24 19:14:17 WARNING Stripped alignment length == 0 for GF00000004 codeml.py:274 INFO Analysing family GF00000005 core.py:3064
followed by the following error:
``
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in call
return self.func(*args, kwargs)
File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in call
return [func(*args, *kwargs) for func, args, kwargs in self.items]
File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in
`` I am assuming this has to do with the families. Correct?
Thanks a lot for your prompt answers. When I run
wgd dmd cds.fa
, it producescds.fa.tsv
, but notfamilies.mcl
. In thewgd ksd -h
, it provides the following whole-paranome examplewgd ksd families.mcl cds.fasta
. that's why I was looking for a file explicitly namesfamilies.mcl
.So, if I understand you correctly, the file
cds.fa.tsv
is the same asfamilies.mcl
, correct?My goal from this analysis is to identify and date the WGD event(s) in my genome. towards this goal, do you think using the protein sequences will perform better than the cds sequences.
To construct a Ks distribution, a cds file is essential. Protein file can be used in the final phylogenetic dating step, but optionally. The cds.fa.tsv
is the whole paranome family file that you can give to wgd ksd
for the construction of Ks distribution.
I wend ahead and tested
wgd ksd pep.cds.tsv cds.fa
, but got many of the following warnings:
2024-04-24 19:14:17 WARNING Stripped alignment length == 0 for GF00000004 codeml.py:274 INFO Analysing family GF00000005 core.py:3064
followed by the following error:`` multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in call return self.func(*args, kwargs) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in call return [func(*args, *kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in return [func(args, kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3221, in _get_ks family.get_ks() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3066, in get_ks self.run_codeml() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3149, in run_codeml preserve=True, times=self.codeml_iter) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/codeml.py", line 288, in run_codeml results = _run_codeml(self.exe, self.control_file, self.out_file, *kwargs) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/codeml.py", line 126, in _run_codeml sp.run([exe, control_file], stdout=sp.PIPE) File "/usr/lib64/python3.6/subprocess.py", line 423, in run with Popen(popenargs, kwargs) as process: File "/usr/lib64/python3.6/subprocess.py", line 729, in init** restore_signals, start_new_session) File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'codeml': 'codeml'
`` I am assuming this has to do with the families. Correct?
The log seems to indicate that you didn't preinstall paml
in your environment variables. You may install wgd v2
via bioconda
to easily fulfill the dependency requirement.
Thanks. I was thinking of using the protein file with dmd
since it has a protein input option and proteins are more suitable for inferring orthogroups. then, with the ksd
, i will use the cds file.
When I use the cds as input for dmd
, i get the following warning:
/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/Bio/Seq.py:2859: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
Do i need to trim the sequences or wgd
is doing that already? this warning is what made me think of using the protein file to generate the whole panarome family.
for some reasons, installing using conda on our HPC is not working and they had to install the tool from the source using the described method:
git clone https://github.com/heche-psb/wgd cd wgd virtualenv -p=python3 ENV (or python3 -m venv ENV) source ENV/bin/activate python setup.py install (or python -m pip install -r requirements.txt) pip install .
We installed PAML withing the wgd environment, which seems to be working. But when I run the ksd
step, i get tons of the following warnings:
`` INFO Analysing family GF00009014 core.py:3064 2024-04-25 12:49:53 WARNING No codeml result for GF00009011 due to no resolved nucleotides codeml.py:291 INFO Analysing family GF00009015 core.py:3064 2024-04-25 12:49:53 WARNING No codeml result for GF00009012 due to no resolved nucleotides codeml.py:291 INFO Analysing family GF00009016 core.py:3064 2024-04-25 12:49:53 WARNING No codeml result for GF00009013 due to no resolved nucleotides codeml.py:291
``
the error log from yesterday is not showing up, so it seems that PAML is working, but the ksd
analysis ends with the following messag:
2024-04-25 12:50:16 INFO Saving to wgd_ksd/spu.cds_extracted.fa.tsv.ks.tsv cli.py:515 2024-04-25 12:50:17 INFO Making plots cli.py:517 INFO No valid Ks values for plotting
Hi, as I marked in the requirement on bioconda
, wgd v2
works with paml 4.9j
. May you install the paml 4.9j
instead and try again? Normally the length of a coding sequence (cds) should be dividable by three but incomplete cds is being seen from time to time which can be handled well in biopython library and thus in wgd v2
. You can do trimming yourself or just leave it with biopython
.
Hi, we installed paml 4.9j, but it didn't solve the problem. In fact, it produced the same error that it produced before installing paml, which is as follows:
multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in __call__ return self.func(*args, **kwargs) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in __call__ return [func(*args, **kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in <listcomp> return [func(*args, **kwargs) for func, args, kwargs in self.items] File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3221, in _get_ks family.get_ks() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3068, in get_ks self.get_tree() File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3161, in get_tree tree = self.run_fasttree(options=self.tree_options) File "/apps/gb/WGD/2.0.32/env/wgd/lib/python3.6/site-packages/wgd/core.py", line 3175, in run_fasttree out = sp.run(cmd, stdout=sp.PIPE, stderr=sp.PIPE) File "/usr/lib64/python3.6/subprocess.py", line 423, in run with Popen(*popenargs, **kwargs) as process: File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__ restore_signals, start_new_session) File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'FastTree': 'FastTree'
I even tried on local linux machine. I installed it via pip and got successful installation messages as follows:
Successfully installed Jinja2-2.11.2 KDEpy-1.1.0 MarkupSafe-1.1.1 Pillow-8.4.0 PyQt5_sip-12.13.0 PyYAML-5.3.1 Pygments-2.7.2 attrs-20.3.0 biopython-1.76 click-7.1.2 colorama-0.4.4 commonmark-0.9.1 cycler-0.10.0 ete3-3.1.3 fastcluster-1.1.28 humanfriendly-8.2 importlib-metadata-3.1.0 iniconfig-1.1.1 joblib-0.11 kiwisolver-1.2.0 matplotlib-3.2.2 numexpr-2.8.6 packaging-20.4 pandas-1.4.4 pluggy-0.13.1 plumbum-1.6.9 progressbar2-3.51.4 py-1.9.0 pyparsing-2.4.7 pyqt5-5.12.3 pyqtwebengine-5.12.1 pytest-6.1.2 python-dateutil-2.9.0.post0 python-utils-2.4.0 pytz-2020.1 rich-12.5.1 scikit-learn-0.23.1 scikit-learn-extra-0.2.0 scipy-1.5.4 seaborn-0.10.1 six-1.15.0 threadpoolctl-2.1.0 toml-0.10.2 tornado-6.0.4 tqdm-4.66.2 typing_extensions-4.11.0 zipp-3.4.0
->wgd==2.0.33) (12.13.0)
Building wheels for collected packages: wgd
Building wheel for wgd (setup.py) ... done
Created wheel for wgd: filename=wgd-2.0.33-py3-none-any.whl size=197849 sha256=ff51712e93564fa31da40273677ff49a7b80eed5ac4bb31017c44dca95674da3
Stored in directory: /tmp/pip-ephem-wheel-cache-t_n4t7yp/wheels/32/db/48/ecef9106010a0615069e0b61866d6d5ed83cc5d83710ced609
Successfully built wgd
Installing collected packages: wgd
Successfully installed wgd-2.0.33
but when I started wgd dmd
, i get the following errors:
Traceback (most recent call last):
File "/home/malabady/mambaforge/envs/wgd/bin/wgd", line 8, in <module>
sys.exit(cli())
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/cli.py", line 117, in dmd
_dmd(**kwargs)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/cli.py", line 148, in _dmd
s[0].get_paranome(inflation=inflation, eval=eval)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/core.py", line 458, in get_paranome
mcl_out = gf.run_mcl(inflation=inflation)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/core.py", line 546, in run_mcl
out = sp.run(command, stdout=sp.PIPE, stderr=sp.PIPE)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/subprocess.py", line 1720, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mcxload'
I even tried to install from the bioconda channel conda install wgd -c bioconda
, but it didn't install all dependencies. And when tried to just wgd -h
, it kept on complaining about every single module in the requirement file.
Hi, the log information "FileNotFoundError: [Errno 2] No such file or directory: 'FastTree': 'FastTree'" indicates that you haven't installed the software fasttree
. Besides fasttree
, mafft
is also required for wgd ksd
. The log information "FileNotFoundError: [Errno 2] No such file or directory: 'mcxload'" indicates that you haven't installed the software mcl
. I have just added note for the dependency of each program in the README. The bioconda
should work if you set a separate virtual environment for wgd v2
because your pre-installed python packages might complain about the incompatibility for the requirement of wgd v2
as you mentioned.
Hello .. all my attempts were in conda environment specific for this program. For now, I want to understand whether the following "warning" indicates an installation issues:
2024-04-26 11:49:28 WARNING Stripped alignment length == 0 for GF00000034 codeml.py:274 INFO Analysing family GF00000037
Hi, the log is not an indication of any installation issue but a normal log informing that the stripped alignment (i.e., after dropping all gap-containing sites) length is zero such that no codeml
analysis will be conducted upon this gene family, which is often seen in big gene families where the extent to which family members diverge with each other is large.
Thank you! I think it might be working at last. the wgd ksd
has been running for a couple of hours now (forgot to increase the # processors) with only some of those warnings. In the previous attempts, it used to throw errors within a few minutes.
Assuming it will finish successfully this time, do you think the default command wgd ksd Families
is sufficient for whole panarome analysis?
Hi, yes I think it will be sufficient and note that increasing the number of threads will significantly speed up the process because it is equal to the number of gene families that can be analyzed by codeml
at the same time.
Hello,
If I am running one species (whole panarome), can the wgd peak
be used without the following options:
-ap, --anchorpoints, the anchor points datafile, default None -sm, --segments, the segments datafile, default None -le, --listelements, the listsegments datafile, default None -mp, --multipliconpairs, the multipliconpairs datafile, default None
By now I have got all parts of the pipeline to work. Many thanks for your help,
Another question:
in the wgd syn
analysis, which ks
results is preferred? I see that ksd
, mix
, and peak
produce a version of the ks
calculations. I am assuming the one from the peak analysis is the most optimized and should be used. Please correct me if I am wrong?
Hi, wgd peak
is mainly for the purpose of finding crediable range of anchor Ks for WGD dating (either in the high mass region of each component or the 95% CI of the lognormal component, and either in the manner of original Ks or segment Ks) via our peak-finding and clustering pipeline thus it's better to provide all those collinearity results to wgd peak
, while wgd mix
is mainly inherited from wgd v1
for GMM analysis which only requires a Ks datafile. But neither of them produces Ks data, which can only be done by wgd ksd
. You may provide the Ks result from wgd ksd
to wgd syn
.
Hello:
when I provide the collinearity results to wgd peak
, i get an error saying list of index out of range
, see below. I don't get this error when I run wgd peak
without collinearity results. so, I assume the error has to do with some of the synteny files.
$ wgd peak wgd_ksd/Spu.cds.ks \
--anchorpoints wgd_syn/iadhore-out/anchorpoints.txt \
--segments wgd_syn/iadhore-out/segments.txt \
--listelements wgd_syn/iadhore-out/list_elements.txt \
--multipliconpairs wgd_syn/iadhore-out/multiplicon_pairs.txt \
--weighted
2024-04-28 10:49:27 INFO This is wgd v2.0.33 cli.py:34
INFO Checking cores and threads... core.py:35
INFO The number of logical CPUs/Hyper Threading in the system: 12 core.py:36
INFO The number of physical cores in the system: 12 core.py:37
INFO The number of actually usable CPUs in the system: 12 core.py:38
INFO Checking memory... core.py:40
INFO Total physical memory: 47.0204 GB core.py:41
INFO Available memory: 39.5765 GB core.py:42
INFO Free memory: 30.4002 GB core.py:43
2024-04-28 10:49:28 INFO Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data peak.py:1742
2024-04-28 10:49:29 INFO Fitting GMM with 1 components peak.py:754
2024-04-28 10:49:30 INFO Convergence reached peak.py:757
INFO Component 1 has mean 0.167, std 2.625, weight 1.000, precision 0.145 peak.py:68
INFO Fitting GMM with 2 components peak.py:754
2024-04-28 10:49:33 INFO Convergence reached peak.py:757
INFO Component 1 has mean 1.305, std 0.667, weight 0.461, precision 2.247 peak.py:68
INFO Component 2 has mean 0.029, std 2.391, weight 0.539, precision 0.175 peak.py:68
INFO Fitting GMM with 3 components peak.py:754
2024-04-28 10:49:36 INFO Convergence reached peak.py:757
INFO Component 1 has mean 0.034, std 1.498, weight 0.441, precision 0.446 peak.py:68
INFO Component 2 has mean 1.275, std 0.690, weight 0.512, precision 2.099 peak.py:68
INFO Component 3 has mean 0.000, std 0.363, weight 0.048, precision 7.582 peak.py:68
INFO Fitting GMM with 4 components peak.py:754
2024-04-28 10:49:40 INFO Convergence reached peak.py:757
INFO Component 1 has mean 1.250, std 0.696, weight 0.529, precision 2.065 peak.py:68
INFO Component 2 has mean 0.014, std 1.158, weight 0.253, precision 0.746 peak.py:68
INFO Component 3 has mean 0.000, std 0.367, weight 0.048, precision 7.406 peak.py:68
INFO Component 4 has mean 0.093, std 0.804, weight 0.169, precision 1.549 peak.py:68
INFO The best fitted model via AIC is with 3 components peak.py:763
INFO Rules-of-thumb (Burnham & Anderson, 2002) compares the AIC-best model and remaining: peak.py:79
INFO model with 1 components gets essentially no support comparing to the AIC-best model peak.py:97
INFO model with 2 components gets essentially no support comparing to the AIC-best model peak.py:97
INFO model with 4 components gets essentially no support comparing to the AIC-best model peak.py:97
INFO The best fitted model via BIC is with 3 components peak.py:765
INFO Rules-of-thumb (Kass & Raftery, 1995) evaluates the outperformance of the BIC-best model over remaining: peak.py:101
INFO Such outperformance is very strongly evidenced for model with 1 components peak.py:113
INFO Such outperformance is very strongly evidenced for model with 2 components peak.py:113
INFO Such outperformance is very strongly evidenced for model with 4 components peak.py:113
Warning: Ignoring XDG_SESSION_TYPE=wayland on Gnome. Use QT_QPA_PLATFORM=wayland to run on Wayland anyway.
2024-04-28 10:49:46 INFO Detecting likely peaks from node-weighted data of 1-Components Model Peak 0 peak.py:1316
2024-04-28 10:49:53 INFO Detecting likely peaks from node-weighted data of 2-Components Model Peak 0 peak.py:1316
INFO Detecting likely peaks from node-weighted data of 2-Components Model Peak 1 peak.py:1316
2024-04-28 10:50:03 INFO Detecting likely peaks from node-weighted data of 3-Components Model Peak 0 peak.py:1316
INFO Detecting likely peaks from node-weighted data of 3-Components Model Peak 1 peak.py:1316
INFO Detecting likely peaks from node-weighted data of 3-Components Model Peak 2 peak.py:1316
Traceback (most recent call last):
File "~/mambaforge/envs/wgd/bin/wgd", line 8, in <module>
sys.exit(cli())
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/cli.py", line 351, in peak
_peak(**kwargs)
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/cli.py", line 372, in _peak
df_ap_mp = fit_apgmm_guide(hdr,guide,anchorpoints,ksdf,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,segment=segments,multipliconpairs=multipliconpairs,listelement=listelements,cutoff = kscutoff,user_xlim=xlim,user_ylim=ylim,peak_threshold=prominence_cutoff,rel_height=rel_height,keeptmp=keeptmpfig)
File "/home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/peak.py", line 1798, in fit_apgmm_guide
get_peak_SegGuidedGMM(guide,kdes,scalings,fig2,ax2,df_c,n,outdir,peak_threshold,weighted,rel_height,keeptmp=keeptmp)
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/peak.py", line 1865, in get_peak_SegGuidedGMM
lower_CI, upper_CI = find_apeak(df,None,None,outdir,peak_threshold=peak_threshold,na=na,rel_height=rel_height,withfig=False,loglabel="{}-Components Model Peak {}".format(n,cp),keeptmp=keeptmp)
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/peak.py", line 1324, in find_apeak
lower95CI,upper95CI = noplot_95CI_lognorm_hist(init_means, init_stdevs, ci=ci)
File "~/mambaforge/envs/wgd/lib/python3.8/site-packages/wgd/peak.py", line 1481, in noplot_95CI_lognorm_hist
return CI_95s[0][0],CI_95s[0][1]
IndexError: list index out of range
Hi, I didn't encounter this type of error on my test dataset before. Could you please share me with your input files? I will try to reproduce your error.
Yes. Please find the attached files. toShare.tar.gz
Hi, the segments.txt
file is missing.
Sorry.. added it
Hi, I did reproduce your error, which is due to an unanticipated failure of finding peak in the segment-guided Ks GMM analysis when it is mapped back to anchor pairs. I have now fixed the error and updated it in v2.0.34
. Please install the latest version and try again.
Many thanks. will the installation still work with conda? do I need to rebuild the environment again?
also, do I need redo the ks, dmd, syn, and mix analyses?
Since I only modified the concerned part of code, you only need to update the version of wgd
while the environment that you built previously should remain unchanged. No additional analysis is needed because it only affects that specific analysis in wgd peak
. The updated version in bioconda
should be available within one day, while on PYPI
or from source code here it's there already.
The updated version v2.0.34
worked at my end. so, no issues.
In your example of Aquilegia coerulea, you used the --globalmrbh
method to delineate the orthologos. I assume you could have done the same with Orthofinder. Is there a particular reason for using global MRBH?
Hi, OrthoFinder2
has two major steps in orthogroup delineation, first performing all-against-all sequence similarity search with gene length-normalization and defining seed orthologues as RBHs
(just as InParanoid
does), second conducting pre-clustering filtering over low-quality edges (homologous relationship that is assumed to exist only between gene family clusters rather than within clusters) to get the preliminary graph and then carrying out the graph-based clustering using MCL
. Throughout the whole process, the most credible orthologues are RBHs
, underpinned by the highest sequence similarity reciprocally. Thus, when it comes to the construction of orthologous Ks distribution, RBHs
(for instance global MRBHs
) should be the stable and reliable proxy instead of orthologues which are delineated upon them conditioning on specific parameters for instance inflation factor and
certain genomes/transcriptomes upon the effect of seed orthologoues.
Hi, thanks for the clarification. I run wgd dmd --globalmrbh
on four cds dataset from my species with focus -f
on my species. it produces 4848 gene families. then I ran ''wgd ksd`` with 12 threads but the run stock at analyzing family #3211. I cleaned the space and rerun the analysis again couple times, but it always gets stuck at this particular family number. the run doesn't abort or anything, it just stays at this step, see below, for days. I check the processors and all threads associated with the have an "S+" status.
2024-05-01 12:31:18 INFO Analysing family GF00003209 core.py:3064 2024-05-01 12:31:22 INFO Analysing family GF00003210 core.py:3064 2024-05-01 12:31:26 INFO Analysing family GF00003211 core.py:3064
Any suggestions?
Hi, I haven't encoutered the condition that the wgd ksd
program hangs simply because of one gene family. Normally the program just ends or casts errors. I guess it might be due to memory overflow but also likely due to that particular gene family GF00003211
as you mentioned. So as to debug, I would suggest you of using only one thread but increase the memory. If with sufficient memory it still fails, then we shall look detailedly into the data, since no warning or error coming from the program itself while more than 3000 gene families actually worked fine.
I think the problem has to do the memory allocation. When I run the job with -n 12
, it stopped at family 3211. but when I run it with -n 1
or give no specific n numbers, it stops at family 1350. See the following psutil log:
`` 08:48:01 $ wgd ksd wgd_globalmrbh/global_MRBH.tsv Spu.cds Dlo.cds Sly.cds Cfo.cds -o wgd_globalmrbh_ks 2024-05-03 08:48:26 INFO This is wgd v2.0.34 cli.py:34 INFO Checking cores and threads... core.py:35 INFO The number of logical CPUs/Hyper Threading in the system: 12 core.py:36 INFO The number of physical cores in the system: 12 core.py:37 INFO The number of actually usable CPUs in the system: 12 core.py:38 INFO Checking memory... core.py:40 INFO Total physical memory: 47.0204 GB core.py:41 INFO Available memory: 36.5698 GB core.py:42 INFO Free memory: 19.2054 GB core.py:43 /home/malabady/mambaforge/envs/wgd/lib/python3.8/site-packages/Bio/Seq.py:2855: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. warnings.warn( 2024-05-03 08:49:15 INFO tmpdir = wgdtmp_f845e8c0-8fd0-4f17-95c1-9d150fd037e2 cli.py:492 2024-05-03 08:49:17 INFO 4 threads are used for 4848 gene families core.py:3241
the machine has 12 cores and 48GB RAM.
Thoughts?
Hi, I think setting the number of thread as 1 and allocating the job with sufficient memory is the solution.
I don't see any memory flags in the command line options. is there a way to calculate there required memory based on the OG families?
This step involves external softwares for instance mafft
, fasttree
, paml
, which makes it hard to give an accurate estimation for the required memory. So far wgd v2
doesn't provide option to set the memory yet, which can only be set in the script of job submitting.
Hi, I trying to analysis a single species using WGD V2. In the species delineation step, I am sure how to generate the families file that I need in the Ks step. I tried all relevant wgd dmd options, but none of them writes a "families" single file. I am sure I am missing something here. I'd really appreciate your answers.
best,