Open RohanNathHERE opened 5 days ago
Hi, the fact that you got normal genes.txt
output while empty anchorpoints.txt
means, in my first thought, that there was no collinearity inferred by i-adhore
. Did you try to use other collinearity inference algorithm for instance MCScan
to inspect the colliearity? It could be that the intra-collinearity of your genome is scarce.
Thank you, I understand the issue now. However, in a few species, I encounter this error:
2024-09-25 17:58:54 INFO This is wgd v2.0.38 cli.py:34
INFO Checking cores and threads... core.py:35
INFO The number of logical CPUs/Hyper Threading in the system: 80 core.py:36
INFO The number of physical cores in the system: 20 core.py:37
INFO The number of actually usable CPUs in the system: 80 core.py:38
INFO Checking memory... core.py:40
INFO Total physical memory: 251.5101 GB core.py:41
INFO Available memory: 227.2910 GB core.py:42
INFO Free memory: 47.3290 GB core.py:43
2024-09-25 17:58:55 INFO Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data peak.py:1754
2024-09-25 17:58:56 INFO Fitting GMM with 1 components peak.py:760
INFO Convergence reached peak.py:763
INFO Component 1 has mean 0.053, std 0.644, weight 1.000, precision 2.411 peak.py:68
INFO Fitting GMM with 2 components peak.py:760
2024-09-25 17:58:57 INFO Convergence reached peak.py:763
INFO Component 1 has mean 0.014, std 0.001, weight 0.067, precision 1000000.000 peak.py:68
INFO Component 2 has mean 0.058, std 0.550, weight 0.933, precision 3.307 peak.py:68
INFO Fitting GMM with 3 components peak.py:760
INFO Convergence reached peak.py:763
INFO Component 1 has mean 0.053, std 0.463, weight 0.867, precision 4.671 peak.py:68
INFO Component 2 has mean 0.014, std 0.001, weight 0.067, precision 1000000.000 peak.py:68
INFO Component 3 has mean 0.186, std 0.001, weight 0.067, precision 1000000.000 peak.py:68
INFO Fitting GMM with 4 components peak.py:760
2024-09-25 17:58:58 INFO Convergence reached peak.py:763
INFO Component 1 has mean 0.078, std 0.282, weight 0.416, precision 12.562 peak.py:68
INFO Component 2 has mean 0.014, std 0.001, weight 0.067, precision 1000000.000 peak.py:68
INFO Component 3 has mean 0.038, std 0.290, weight 0.451, precision 11.868 peak.py:68
INFO Component 4 has mean 0.186, std 0.001, weight 0.067, precision 1000000.000 peak.py:68
INFO The best fitted model via AIC is with 3 components peak.py:769
INFO Rules-of-thumb (Burnham & Anderson, 2002) compares the AIC-best model and remaining: peak.py:79
INFO model with 1 components gets few support comparing to the AIC-best model peak.py:95
INFO model with 2 components gets considerably less support comparing to the AIC-best model peak.py:93
INFO model with 4 components gets considerably less support comparing to the AIC-best model peak.py:93
INFO The best fitted model via BIC is with 3 components peak.py:771
INFO Rules-of-thumb (Kass & Raftery, 1995) evaluates the outperformance of the BIC-best model over remaining: peak.py:101
INFO Such outperformance is positively evidenced for model with 1 components peak.py:109
INFO Such outperformance is positively evidenced for model with 2 components peak.py:109
INFO Such outperformance is strongly evidenced for model with 4 components peak.py:111
2024-09-25 17:59:00 INFO Detecting likely peaks from node-averaged data of 1-Components Model Peak 0 peak.py:1323
2024-09-25 17:59:02 INFO Detecting likely peaks from node-averaged data of 2-Components Model Peak 0 peak.py:1323
INFO Detecting likely peaks from node-averaged data of 2-Components Model Peak 1 peak.py:1323
2024-09-25 17:59:05 INFO Detected one component with less than 2 valid elements, will skip it peak.py:561
INFO Detected one component with less than 2 valid elements, will skip it peak.py:561
INFO Detecting likely peaks from node-averaged data of 3-Components Model Peak 0 peak.py:1323
INFO Detecting likely peaks from node-averaged data of 3-Components Model Peak 1 peak.py:1323
mc2024-09-25 17:59:08 INFO Detected one component with less than 2 valid elements, will skip it peak.py:561
2024-09-25 17:59:09 INFO Detected one component with less than 2 valid elements, will skip it peak.py:561
INFO Detecting likely peaks from node-averaged data of 4-Components Model Peak 0 peak.py:1323
INFO Detecting likely peaks from node-averaged data of 4-Components Model Peak 1 peak.py:1323
INFO Detecting likely peaks from node-averaged data of 4-Components Model Peak 2 peak.py:1323
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py:1831: RuntimeWarning: divide by zero encountered in log
X_log = np.log(X).reshape(-1, 1)
INFO Gaussian Mixture Modeling (GMM) on Log-scale original anchor Ks data peak.py:1833
INFO Fitting GMM with 1 components peak.py:760
Traceback (most recent call last):
File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
sys.exit(cli())
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
_peak(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 373, in _peak
df_ap = fit_apgmm_ap(hdr,anchorpoints,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,cutoff = kscutoff,peak_threshold=prominence_cutoff,rel_height=rel_height,user_xlim=xlim,user_ylim=ylim)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1834, in fit_apgmm_ap
if method == 'gmm': models, aic, bic, besta, bestb, N = fit_gmm(out_file, X_log, seed, components[0], components[1], em_iter=em_iter, n_init=n_init)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 761, in fit_gmm
models[i-n1] = mixture.GaussianMixture(n_components = i, covariance_type='full', max_iter = em_iter, n_init = n_init, random_state = seed).fit(X)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 193, in fit
self.fit_predict(X, y)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 220, in fit_predict
X = _check_X(X, self.n_components, ensure_min_samples=2)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 52, in _check_X
X = check_array(X, dtype=[np.float64, np.float32],
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 645, in check_array
_assert_all_finite(array,
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 97, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Could you please explain why this occurs and how I can avoid it?
Hi, it's due to the scarce of available anchor Ks values. How many anchor pairs are there and how many of them have Ks values?
I found around 76 rows in the anchorpoints.txt
file; I guess is_real_anchorpoint = -1
denotes that it is not a real anchorpoint. I’ve attached my result zip file for your review. Did I make any mistakes or misunderstand anything?
Hi, the number of anchor pairs is 76, meaning that the number of anchor pairs with Ks within the range from 0 to 5 is even smaller. It's because either the WGD event is very ancient (for instance older than 150 mya) or the WGD retention rate is very low (for instance lower than 0.01), or both. In that case, I will make sure every anchor pair has its Ks value, rather than NaN
. You can specifically calculate the Ks for those anchor pairs with NaN
Ks values using the pairwise mode in wgd ksd
.
Thanks! That makes sense. I'll give it a try. However, I'm encountering another error while running wgd peak
for another species:
2024-09-25 20:09:01 INFO This is wgd v2.0.38 cli.py:34
INFO Checking cores and threads... core.py:35
INFO The number of logical CPUs/Hyper Threading in the system: 80 core.py:36
INFO The number of physical cores in the system: 20 core.py:37
INFO The number of actually usable CPUs in the system: 80 core.py:38
INFO Checking memory... core.py:40
INFO Total physical memory: 251.5101 GB core.py:41
INFO Available memory: 226.8054 GB core.py:42
INFO Free memory: 41.8755 GB core.py:43
INFO Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data peak.py:1754
INFO Fitting GMM with 1 components peak.py:760
2024-09-25 20:09:02 INFO Convergence reached peak.py:763
INFO Component 1 has mean 0.778, std 0.086, weight 1.000, precision 136.689 peak.py:68
INFO Fitting GMM with 2 components peak.py:760
INFO Convergence reached peak.py:763
INFO Component 1 has mean 0.714, std 0.001, weight 0.500, precision 1000000.000 peak.py:68
INFO Component 2 has mean 0.847, std 0.001, weight 0.500, precision 1000000.000 peak.py:68
INFO Fitting GMM with 3 components peak.py:760
Traceback (most recent call last):
File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
sys.exit(cli())
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
_peak(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 372, in _peak
df_ap_mp = fit_apgmm_guide(hdr,guide,anchorpoints,ksdf,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,segment=segments,multipliconpairs=multipliconpairs,listelement=listelements,cutoff = kscutoff,user_xlim=xlim,user_ylim=ylim,peak_threshold=prominence_cutoff,rel_height=rel_height,keeptmp=keeptmpfig)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1773, in fit_apgmm_guide
if method == 'gmm': models, aic, bic, besta, bestb, N = fit_gmm(out_file, X_log, seed, components[0], components[1], em_iter=em_iter, n_init=n_init)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 761, in fit_gmm
models[i-n1] = mixture.GaussianMixture(n_components = i, covariance_type='full', max_iter = em_iter, n_init = n_init, random_state = seed).fit(X)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 193, in fit
self.fit_predict(X, y)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 220, in fit_predict
X = _check_X(X, self.n_components, ensure_min_samples=2)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 55, in _check_X
raise ValueError('Expected n_samples >= n_components '
ValueError: Expected n_samples >= n_components but got n_components = 3, n_samples = 2
It seems that the number of samples is insufficient for the number of components in the Gaussian Mixture Model. Do you have any suggestions on how to resolve this issue?
Hi, it showed that there were only two segment-wise Ks values, which made it impossible to cluster them into 3 clusters. It's also linked with scarce collinearity.
Thanks! I encountered another error while running wgd peak
, which seems to be related to the Gaussian Mixture Model and silhouette score calculation:
Traceback (most recent call last):
File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
sys.exit(cli())
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
_peak(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 372, in _peak
df_ap_mp = fit_apgmm_guide(hdr, guide, anchorpoints, ksdf, ksdf_filtered, seed, components, em_iter, n_init, outdir, method, gamma, weighted, plot, segment=segments, multipliconpairs=multipliconpairs, listelement=listelements, cutoff=kscutoff, user_xlim=xlim, user_ylim=ylim, peak_threshold=prominence_cutoff, rel_height=rel_height, keeptmp=keeptmpfig)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1776, in fit_apgmm_guide
plot_silhouette_score(X_log, components[0]+1, components[1], [m.predict(X_log) for m in models][1:], outdir, guide+'_Ks', 'GMM')
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 148, in plot_silhouette_score
scores = [metrics.silhouette_score(X, label) for label in labels]
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 148, in <listcomp>
scores = [metrics.silhouette_score(X, label) for label in labels]
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 229, in silhouette_samples
check_number_of_labels(len(le.classes_), n_samples)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 34, in check_number_of_labels
raise ValueError("Number of labels is %d. Valid values are 2 to n_samples - 1 (inclusive)" % n_labels)
ValueError: Number of labels is 4. Valid values are 2 to n_samples - 1 (inclusive).
This error indicates that there are four labels in the dataset, but the number of samples is insufficient for this configuration. Specifically, the number of valid labels must be between 2 and n_samples−1
. Since the dataset may not have enough samples, I’m unsure how to proceed. Do you have any advice on resolving this?
Hi, I think the problem is also at the only two available segment-wise Ks. If you set the number of components as --components 1 1
or --components 1 2
, it might be solved.
Thanks for the clarification! I believe this error is due to the same reasons as listed above?
2024-09-25 20:23:55 INFO This is wgd v2.0.38 cli.py:34
INFO Checking cores and threads... core.py:35
INFO The number of logical CPUs/Hyper Threading in the system: 80 core.py:36
INFO The number of physical cores in the system: 20 core.py:37
INFO The number of actually usable CPUs in the system: 80 core.py:38
INFO Checking memory... core.py:40
INFO Total physical memory: 251.5101 GB core.py:41
INFO Available memory: 226.9499 GB core.py:42
INFO Free memory: 41.7877 GB core.py:43
2024-09-25 20:23:56 INFO Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data peak.py:1754
INFO Fitting GMM with 1 components peak.py:760
Traceback (most recent call last):
File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
sys.exit(cli())
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
_peak(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 372, in _peak
df_ap_mp = fit_apgmm_guide(hdr,guide,anchorpoints,ksdf,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,segment=segments,multipliconpairs=multipliconpairs,listelement=listelements,cutoff = kscutoff,user_xlim=xlim,user_ylim=ylim,peak_threshold=prominence_cutoff,rel_height=rel_height,keeptmp=keeptmpfig)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1773, in fit_apgmm_guide
if method == 'gmm': models, aic, bic, besta, bestb, N = fit_gmm(out_file, X_log, seed, components[0], components[1], em_iter=em_iter, n_init=n_init)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 761, in fit_gmm
models[i-n1] = mixture.GaussianMixture(n_components = i, covariance_type='full', max_iter = em_iter, n_init = n_init, random_state = seed).fit(X)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 193, in fit
self.fit_predict(X, y)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 220, in fit_predict
X = _check_X(X, self.n_components, ensure_min_samples=2)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 52, in _check_X
X = check_array(X, dtype=[np.float64, np.float32],
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 651, in check_array
raise ValueError("Found array with %d sample(s) (shape=%s) while a"
ValueError: Found array with 1 sample(s) (shape=(1, 1)) while a minimum of 2 is required.
Hi, the source of the bug is the 2 segment-wise Ks in your data, which is too few.
Hi, so I was running wgd ksd
as per your suggestion using the --pairwise
parameter, but I'm encountering the following issue:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 350, in __call__
return self.func(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py", line 3221, in _get_ks
family.get_ks()
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py", line 3066, in get_ks
self.run_codeml()
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py", line 3145, in run_codeml
result, no_result = codeml.run_codeml_pairwise(
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/codeml.py", line 322, in run_codeml_pairwise
return pd.concat(results), no_results
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 347, in concat
op = _Concatenator(
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 404, in __init__
raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 359, in __call__
raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException
___________________________________________________________________________
ValueError Thu Sep 26 16:14:39 2024
PID: 174911Python 3.8.19: /home/workspace2/anaconda3/envs/new_wgd_env/bin/python3.8
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(<function _get_ks>, (<wgd.core.GeneFamily object>,), {})]
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func = <function _get_ks>
args = (<wgd.core.GeneFamily object>,)
kwargs = {}
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in _get_ks(family=<wgd.core.GeneFamily object>)
3216 node_averaged_dS_exc = df.groupby(["family", "node"])["dS"].mean()
3217 node_averaged_dS_exc = node_averaged_dS_exc.to_frame(name='node_averaged_dS_outlierexcluded')
3218 return node_averaged_dS_exc
3219
3220 def _get_ks(family):
-> 3221 family.get_ks()
family.get_ks = <bound method GeneFamily.get_ks of <wgd.core.GeneFamily object>>
3222 if family.codeml_results.shape[1] !=3:
3223 weight_inc = get_outlierincluded(family.codeml_results)
3224 weight_exc = get_outlierexcluded(family.codeml_results,cutoff = 5)
3225 node_averaged_dS_inc = get_nodeaverged_dS_outlierincluded(family.codeml_results)
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in get_ks(self=<wgd.core.GeneFamily object>)
3061 self.pairwise = pairwise
3062
3063 def get_ks(self):
3064 logging.info("Analysing family {}".format(self.id))
3065 self.align()
-> 3066 self.run_codeml()
self.run_codeml = <bound method GeneFamily.run_codeml of <wgd.core.GeneFamily object>>
3067 if self.codeml_results is not None:
3068 self.get_tree()
3069 self.compile_dataframe()
3070 self.combine_results()
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in run_codeml(self=<wgd.core.GeneFamily object>)
3140
3141 def run_codeml(self):
3142 codeml = Codeml(self.cds_aln, exe="codeml", tmp=self.tmp_path, prefix=self.id)
3143 # TODO, do something with `no_result`
3144 if self.pairwise:
-> 3145 result, no_result = codeml.run_codeml_pairwise(
result = undefined
no_result = undefined
codeml.run_codeml_pairwise = <bound method Codeml.run_codeml_pairwise of <wgd.codeml.Codeml object>>
self.codeml_iter = 1
3146 preserve=True, times=self.codeml_iter)
3147 else:
3148 result, no_result = codeml.run_codeml(
3149 preserve=True, times=self.codeml_iter)
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/codeml.py in run_codeml_pairwise(self=<wgd.codeml.Codeml object>, **kwargs={'preserve': True, 'times': 1})
317 results.append(tmp_result)
318 if len(no_results) > 0:
319 logging.warning("Alignment length of 0 for {} pairs in {}".format(
320 len(no_results), self.prefix))
321 os.chdir(parentdir)
--> 322 return pd.concat(results), no_results
results = []
no_results = [['cds.faa_07813', 'cds.faa_00553'], ['cds.faa_07813', 'cds.faa_05156'], ['cds.faa_07813', 'cds.faa_04582'], ['cds.faa_07813', 'cds.faa_07258'], ['cds.faa_07813', 'cds.faa_05069'], ['cds.faa_07813', 'cds.faa_06318'], ['cds.faa_07813', 'cds.faa_01522'], ['cds.faa_07813', 'cds.faa_00015'], ['cds.faa_07813', 'cds.faa_08322'], ['cds.faa_07813', 'cds.faa_00130'], ['cds.faa_07813', 'cds.faa_04206'], ['cds.faa_07813', 'cds.faa_00708'], ['cds.faa_07813', 'cds.faa_02030'], ['cds.faa_07813', 'cds.faa_02258'], ['cds.faa_07813', 'cds.faa_06801'], ['cds.faa_07813', 'cds.faa_02732'], ['cds.faa_07813', 'cds.faa_00872'], ['cds.faa_07813', 'cds.faa_07008'], ['cds.faa_07813', 'cds.faa_05281'], ['cds.faa_07813', 'cds.faa_03598'], ...]
323
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args=([],), **kwargs={})
306 warnings.warn(
307 msg.format(arguments=arguments),
308 FutureWarning,
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
args = ([],)
kwargs = {}
312
313 return wrapper
314
315 return decorate
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs=[], axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
342 >>> pd.concat([df5, df6], verify_integrity=True)
343 Traceback (most recent call last):
344 ...
345 ValueError: Indexes have overlapping values: ['a']
346 """
--> 347 op = _Concatenator(
op = undefined
objs = []
axis = 0
ignore_index = False
join = 'outer'
keys = None
levels = None
names = None
verify_integrity = False
copy = True
sort = False
348 objs,
349 axis=axis,
350 ignore_index=ignore_index,
351 join=join,
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in __init__(self=<pandas.core.reshape.concat._Concatenator object>, objs=[], axis=0, join='outer', keys=None, levels=None, names=None, ignore_index=False, verify_integrity=False, copy=True, sort=False)
399 objs = [objs[k] for k in keys]
400 else:
401 objs = list(objs)
402
403 if len(objs) == 0:
--> 404 raise ValueError("No objects to concatenate")
405
406 if keys is None:
407 objs = list(com.not_none(*objs))
408 else:
ValueError: No objects to concatenate
___________________________________________________________________________
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
joblib.my_exceptions.TransportableException: TransportableException
___________________________________________________________________________
ValueError Thu Sep 26 16:14:39 2024
PID: 174911Python 3.8.19: /home/workspace2/anaconda3/envs/new_wgd_env/bin/python3.8
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(<function _get_ks>, (<wgd.core.GeneFamily object>,), {})]
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func = <function _get_ks>
args = (<wgd.core.GeneFamily object>,)
kwargs = {}
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in _get_ks(family=<wgd.core.GeneFamily object>)
3216 node_averaged_dS_exc = df.groupby(["family", "node"])["dS"].mean()
3217 node_averaged_dS_exc = node_averaged_dS_exc.to_frame(name='node_averaged_dS_outlierexcluded')
3218 return node_averaged_dS_exc
3219
3220 def _get_ks(family):
-> 3221 family.get_ks()
family.get_ks = <bound method GeneFamily.get_ks of <wgd.core.GeneFamily object>>
3222 if family.codeml_results.shape[1] !=3:
3223 weight_inc = get_outlierincluded(family.codeml_results)
3224 weight_exc = get_outlierexcluded(family.codeml_results,cutoff = 5)
3225 node_averaged_dS_inc = get_nodeaverged_dS_outlierincluded(family.codeml_results)
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in get_ks(self=<wgd.core.GeneFamily object>)
3061 self.pairwise = pairwise
3062
3063 def get_ks(self):
3064 logging.info("Analysing family {}".format(self.id))
3065 self.align()
-> 3066 self.run_codeml()
self.run_codeml = <bound method GeneFamily.run_codeml of <wgd.core.GeneFamily object>>
3067 if self.codeml_results is not None:
3068 self.get_tree()
3069 self.compile_dataframe()
3070 self.combine_results()
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in run_codeml(self=<wgd.core.GeneFamily object>)
3140
3141 def run_codeml(self):
3142 codeml = Codeml(self.cds_aln, exe="codeml", tmp=self.tmp_path, prefix=self.id)
3143 # TODO, do something with `no_result`
3144 if self.pairwise:
-> 3145 result, no_result = codeml.run_codeml_pairwise(
result = undefined
no_result = undefined
codeml.run_codeml_pairwise = <bound method Codeml.run_codeml_pairwise of <wgd.codeml.Codeml object>>
self.codeml_iter = 1
3146 preserve=True, times=self.codeml_iter)
3147 else:
3148 result, no_result = codeml.run_codeml(
3149 preserve=True, times=self.codeml_iter)
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/codeml.py in run_codeml_pairwise(self=<wgd.codeml.Codeml object>, **kwargs={'preserve': True, 'times': 1})
317 results.append(tmp_result)
318 if len(no_results) > 0:
319 logging.warning("Alignment length of 0 for {} pairs in {}".format(
320 len(no_results), self.prefix))
321 os.chdir(parentdir)
--> 322 return pd.concat(results), no_results
results = []
no_results = [['cds.faa_07813', 'cds.faa_00553'], ['cds.faa_07813', 'cds.faa_05156'], ['cds.faa_07813', 'cds.faa_04582'], ['cds.faa_07813', 'cds.faa_07258'], ['cds.faa_07813', 'cds.faa_05069'], ['cds.faa_07813', 'cds.faa_06318'], ['cds.faa_07813', 'cds.faa_01522'], ['cds.faa_07813', 'cds.faa_00015'], ['cds.faa_07813', 'cds.faa_08322'], ['cds.faa_07813', 'cds.faa_00130'], ['cds.faa_07813', 'cds.faa_04206'], ['cds.faa_07813', 'cds.faa_00708'], ['cds.faa_07813', 'cds.faa_02030'], ['cds.faa_07813', 'cds.faa_02258'], ['cds.faa_07813', 'cds.faa_06801'], ['cds.faa_07813', 'cds.faa_02732'], ['cds.faa_07813', 'cds.faa_00872'], ['cds.faa_07813', 'cds.faa_07008'], ['cds.faa_07813', 'cds.faa_05281'], ['cds.faa_07813', 'cds.faa_03598'], ...]
323
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args=([],), **kwargs={})
306 warnings.warn(
307 msg.format(arguments=arguments),
308 FutureWarning,
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
args = ([],)
kwargs = {}
312
313 return wrapper
314
315 return decorate
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs=[], axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
342 >>> pd.concat([df5, df6], verify_integrity=True)
343 Traceback (most recent call last):
344 ...
345 ValueError: Indexes have overlapping values: ['a']
346 """
--> 347 op = _Concatenator(
op = undefined
objs = []
axis = 0
ignore_index = False
join = 'outer'
keys = None
levels = None
names = None
verify_integrity = False
copy = True
sort = False
348 objs,
349 axis=axis,
350 ignore_index=ignore_index,
351 join=join,
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in __init__(self=<pandas.core.reshape.concat._Concatenator object>, objs=[], axis=0, join='outer', keys=None, levels=None, names=None, ignore_index=False, verify_integrity=False, copy=True, sort=False)
399 objs = [objs[k] for k in keys]
400 else:
401 objs = list(objs)
402
403 if len(objs) == 0:
--> 404 raise ValueError("No objects to concatenate")
405
406 if keys is None:
407 objs = list(com.not_none(*objs))
408 else:
ValueError: No objects to concatenate
___________________________________________________________________________
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
sys.exit(cli())
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 469, in ksd
_ksd(**kwargs)
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 507, in _ksd
ksdb.get_distribution()
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py", line 3242, in get_distribution
Parallel(n_jobs=self.n_threads,backend='multiprocessing')(
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 789, in __call__
self.retrieve()
File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 740, in retrieve
raise exception
joblib.my_exceptions.JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd in <module>()
3 import re
4 import sys
5 from cli import cli
6 if __name__ == '__main__':
7 sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
----> 8 sys.exit(cli())
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in __call__(self=<Group cli>, *args=(), **kwargs={})
824 echo("Aborted!", file=sys.stderr)
825 sys.exit(1)
826
827 def __call__(self, *args, **kwargs):
828 """Alias for :meth:`main`."""
--> 829 return self.main(*args, **kwargs)
self.main = <bound method BaseCommand.main of <Group cli>>
args = ()
kwargs = {}
830
831
832 class Command(BaseCommand):
833 """Commands are the basic building block of command line interfaces in
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in main(self=<Group cli>, args=['ksd', 'wgd_dmd/cds.faa.tsv', 'cds.faa', '--cds', '--strip_gaps', '--outdir', 'wgd_ksd', '--nthreads', '8', '--pairwise'], prog_name='wgd', complete_var=None, standalone_mode=True, **extra={})
777 _bashcomplete(self, prog_name, complete_var)
778
779 try:
780 try:
781 with self.make_context(prog_name, args, **extra) as ctx:
--> 782 rv = self.invoke(ctx)
rv = undefined
self.invoke = <bound method MultiCommand.invoke of <Group cli>>
ctx = <click.core.Context object>
783 if not standalone_mode:
784 return rv
785 # it's not safe to `ctx.exit(rv)` here!
786 # note that `rv` may actually contain data like "1" which
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in invoke(self=<Group cli>, ctx=<click.core.Context object>)
1254 cmd_name, cmd, args = self.resolve_command(ctx, args)
1255 ctx.invoked_subcommand = cmd_name
1256 Command.invoke(self, ctx)
1257 sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)
1258 with sub_ctx:
-> 1259 return _process_result(sub_ctx.command.invoke(sub_ctx))
_process_result = <function MultiCommand.invoke.<locals>._process_result>
sub_ctx.command.invoke = <bound method Command.invoke of <Command ksd>>
sub_ctx = <click.core.Context object>
1260
1261 # In chain mode we create the contexts step by step, but after the
1262 # base command has been invoked. Because at that point we do not
1263 # know the subcommands yet, the invoked subcommand attribute is
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in invoke(self=<Command ksd>, ctx=<click.core.Context object>)
1061 """Given a context, this invokes the attached callback (if it exists)
1062 in the right way.
1063 """
1064 _maybe_show_deprecated_notice(self)
1065 if self.callback is not None:
-> 1066 return ctx.invoke(self.callback, **ctx.params)
ctx.invoke = <bound method Context.invoke of <click.core.Context object>>
self.callback = <function ksd>
ctx.params = {'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...}
1067
1068
1069 class MultiCommand(Command):
1070 """A multi command is the basic implementation of a command that
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in invoke(*args=(), **kwargs={'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...})
605 kwargs[param.name] = param.get_default(ctx)
606
607 args = args[2:]
608 with augment_usage_errors(self):
609 with ctx:
--> 610 return callback(*args, **kwargs)
callback = <function ksd>
args = ()
kwargs = {'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...}
611
612 def forward(*args, **kwargs): # noqa: B902
613 """Similar to :meth:`invoke` but fills in default keyword
614 arguments from the current context if the other command expects
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py in ksd(**kwargs={'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...})
464 wgd ksd orthologs.rbh cds1.fasta cds2.fasta
465
466 If you want to keep intermediate (temporary) files, please provide a directory
467 name for the `--tmpdir` option.
468 """
--> 469 _ksd(**kwargs)
kwargs = {'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...}
470
471 def _ksd(families, sequences, outdir, tmpdir, nthreads, to_stop, cds, pairwise,
472 strip_gaps, aligner, aln_options, tree_method, tree_options, node_average, spair, speciestree, reweight, onlyrootout, extraparanomeks, anchorpoints, plotkde, plotapgmm, plotelmm, components,xlim,ylim,adjustortho,adjustfactor,okalpha,focus2all,kstree,onlyconcatkstree,classic,toparrow, bootstrap):
473 from wgd.core import get_gene_families, SequenceData, KsDistributionBuilder
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py in _ksd(families='wgd_dmd/cds.faa.tsv', sequences=('cds.faa',), outdir='wgd_ksd', tmpdir=None, nthreads=8, to_stop=False, cds=True, pairwise=True, strip_gaps=True, aligner='mafft', aln_options='--auto', tree_method='fasttree', tree_options=None, node_average=False, spair=(), speciestree=None, reweight=False, onlyrootout=False, extraparanomeks=None, anchorpoints=None, plotkde=False, plotapgmm=False, plotelmm=False, components=(1, 4), xlim=(None, None), ylim=(None, None), adjustortho=False, adjustfactor=1.0, okalpha=0.5, focus2all=None, kstree=False, onlyconcatkstree=False, classic=False, toparrow=False, bootstrap=200)
502 #getconcataln(s, families, nthreads, outdir, option="--auto")
503 #cds_alns, pro_alns, tree_famsf, calnfs, palnfs, calnfs_length, cds_fastaf, tree_fams = get_MultipRBH_gene_families(seqs,fams,tree_method,treeset,outdir,nthreads,option="--auto",runtree=False)
504 #calculatekstree(fams,s,spgenemap)
505 #exit()
506 ksdb = KsDistributionBuilder(fams, s, n_threads=nthreads)
--> 507 ksdb.get_distribution()
ksdb.get_distribution = <bound method KsDistributionBuilder.get_distribution of <wgd.core.KsDistributionBuilder object>>
508 prefix = os.path.basename(families)
509 outfile = os.path.join(outdir, "{}.ks.tsv".format(prefix))
510 logging.info("Saving to {}".format(outfile))
511 ksdb.df.fillna("NaN").to_csv(outfile,sep="\t")
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in get_distribution(self=<wgd.core.KsDistributionBuilder object>)
3237 self.seqs = seqs
3238 self.n_threads = n_threads
3239
3240 def get_distribution(self):
3241 if self.n_threads < len(self.families): logging.info("{} threads are used for {} gene families\nNote that adding threads can significantly accelerate the Ks estimation process".format(int(self.n_threads),int(len(self.families))))
-> 3242 Parallel(n_jobs=self.n_threads,backend='multiprocessing')(
self.n_threads = 8
self.families = [<wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, ...]
3243 delayed(_get_ks)(family) for family in self.families)
3244 df = pd.concat([pd.read_csv(x.out, index_col=None)
3245 for x in self.families], sort=True)
3246 self.df = add_original_ids(df, self.seqs)
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in __call__(self=Parallel(n_jobs=8), iterable=<generator object KsDistributionBuilder.get_distribution.<locals>.<genexpr>>)
784 if pre_dispatch == "all" or n_jobs == 1:
785 # The iterable was consumed all at once by the above for loop.
786 # No need to wait for async callbacks to trigger to
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=8)>
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
792 self._print('Done %3i out of %3i | elapsed: %s finished',
793 (len(self._output), len(self._output),
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError Thu Sep 26 16:14:39 2024
PID: 174911Python 3.8.19: /home/workspace2/anaconda3/envs/new_wgd_env/bin/python3.8
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(<function _get_ks>, (<wgd.core.GeneFamily object>,), {})]
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func = <function _get_ks>
args = (<wgd.core.GeneFamily object>,)
kwargs = {}
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in _get_ks(family=<wgd.core.GeneFamily object>)
3216 node_averaged_dS_exc = df.groupby(["family", "node"])["dS"].mean()
3217 node_averaged_dS_exc = node_averaged_dS_exc.to_frame(name='node_averaged_dS_outlierexcluded')
3218 return node_averaged_dS_exc
3219
3220 def _get_ks(family):
-> 3221 family.get_ks()
family.get_ks = <bound method GeneFamily.get_ks of <wgd.core.GeneFamily object>>
3222 if family.codeml_results.shape[1] !=3:
3223 weight_inc = get_outlierincluded(family.codeml_results)
3224 weight_exc = get_outlierexcluded(family.codeml_results,cutoff = 5)
3225 node_averaged_dS_inc = get_nodeaverged_dS_outlierincluded(family.codeml_results)
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in get_ks(self=<wgd.core.GeneFamily object>)
3061 self.pairwise = pairwise
3062
3063 def get_ks(self):
3064 logging.info("Analysing family {}".format(self.id))
3065 self.align()
-> 3066 self.run_codeml()
self.run_codeml = <bound method GeneFamily.run_codeml of <wgd.core.GeneFamily object>>
3067 if self.codeml_results is not None:
3068 self.get_tree()
3069 self.compile_dataframe()
3070 self.combine_results()
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in run_codeml(self=<wgd.core.GeneFamily object>)
3140
3141 def run_codeml(self):
3142 codeml = Codeml(self.cds_aln, exe="codeml", tmp=self.tmp_path, prefix=self.id)
3143 # TODO, do something with `no_result`
3144 if self.pairwise:
-> 3145 result, no_result = codeml.run_codeml_pairwise(
result = undefined
no_result = undefined
codeml.run_codeml_pairwise = <bound method Codeml.run_codeml_pairwise of <wgd.codeml.Codeml object>>
self.codeml_iter = 1
3146 preserve=True, times=self.codeml_iter)
3147 else:
3148 result, no_result = codeml.run_codeml(
3149 preserve=True, times=self.codeml_iter)
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/codeml.py in run_codeml_pairwise(self=<wgd.codeml.Codeml object>, **kwargs={'preserve': True, 'times': 1})
317 results.append(tmp_result)
318 if len(no_results) > 0:
319 logging.warning("Alignment length of 0 for {} pairs in {}".format(
320 len(no_results), self.prefix))
321 os.chdir(parentdir)
--> 322 return pd.concat(results), no_results
results = []
no_results = [['cds.faa_07813', 'cds.faa_00553'], ['cds.faa_07813', 'cds.faa_05156'], ['cds.faa_07813', 'cds.faa_04582'], ['cds.faa_07813', 'cds.faa_07258'], ['cds.faa_07813', 'cds.faa_05069'], ['cds.faa_07813', 'cds.faa_06318'], ['cds.faa_07813', 'cds.faa_01522'], ['cds.faa_07813', 'cds.faa_00015'], ['cds.faa_07813', 'cds.faa_08322'], ['cds.faa_07813', 'cds.faa_00130'], ['cds.faa_07813', 'cds.faa_04206'], ['cds.faa_07813', 'cds.faa_00708'], ['cds.faa_07813', 'cds.faa_02030'], ['cds.faa_07813', 'cds.faa_02258'], ['cds.faa_07813', 'cds.faa_06801'], ['cds.faa_07813', 'cds.faa_02732'], ['cds.faa_07813', 'cds.faa_00872'], ['cds.faa_07813', 'cds.faa_07008'], ['cds.faa_07813', 'cds.faa_05281'], ['cds.faa_07813', 'cds.faa_03598'], ...]
323
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args=([],), **kwargs={})
306 warnings.warn(
307 msg.format(arguments=arguments),
308 FutureWarning,
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
args = ([],)
kwargs = {}
312
313 return wrapper
314
315 return decorate
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs=[], axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
342 >>> pd.concat([df5, df6], verify_integrity=True)
343 Traceback (most recent call last):
344 ...
345 ValueError: Indexes have overlapping values: ['a']
346 """
--> 347 op = _Concatenator(
op = undefined
objs = []
axis = 0
ignore_index = False
join = 'outer'
keys = None
levels = None
names = None
verify_integrity = False
copy = True
sort = False
348 objs,
349 axis=axis,
350 ignore_index=ignore_index,
351 join=join,
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in __init__(self=<pandas.core.reshape.concat._Concatenator object>, objs=[], axis=0, join='outer', keys=None, levels=None, names=None, ignore_index=False, verify_integrity=False, copy=True, sort=False)
399 objs = [objs[k] for k in keys]
400 else:
401 objs = list(objs)
402
403 if len(objs) == 0:
--> 404 raise ValueError("No objects to concatenate")
405
406 if keys is None:
407 objs = list(com.not_none(*objs))
408 else:
ValueError: No objects to concatenate
___________________________________________________________________________
Any idea on how to solve this?
Hi, what is your full command? It seems that your gene family file might have some issues.
This is my wgd ksd command::
wgd ksd wgd_dmd/cds.faa.tsv cds.faa --cds --strip_gaps --outdir wgd_ksd --nthreads 8 --pairwise
and this was my wgd dmd command:
wgd dmd cds.faa --eval 1e-5 --inflation 1.5 --nthreads 50 --keepfasta --cscore 0.7 --orthoinfer --outdir wgd_dmd
I'm attaching my wgd dmd output for your reference: wgd_dmd.zip
Thank you for the help!
Hi, I suggest removing the flag option --cds
and --strip_gaps
and try again. It might be that there was no strict cds left for some gene families. To make sure as many NaN
as possible can be successfully turned into a meaningful Ks value for your anchor pairs, you should not strip gaps.
Thank you for the clarification. I ran the script without the --cds
and --strip_gaps
parameters and added the --pairwise
option, but I’m still not getting sufficient anchor Ks values. Let me elaborate on my objective: I am investigating WGD in 52 different molluscan species and hypothesize that WGD is highly unlikely. This hypothesis seems to be supported by your tool, as we observed a lack of collinearity between gene pairs, which likely suggests the absence of WGD.
Is my inference correct, or could I be overlooking something important? Additionally, is there any other data or analysis I could include to strengthen my hypothesis?
Hi, scarce collinearity is a golden evidence for the absence of WGDs. You can conclude that your molluscan genomes are lack of any recent WGDs. But the alternative hypothesis, that there might be very ancient WGDs, needs to be formally tested using phylogenomic methods such as Whale or at least rate-corrected Ks distribution. It depends on how deep in the phylogeny you want to claim the absence of WGDs. If you want to demonstrate that all molluscans haven't experienced any WGD event since their origin, you can compare the divergence time of molluscans and their closest relatives to the Ks peak of molluscan genomes if any in a mixed Ks plot or test the WGD occurence in the crown node of molluscans using the DL+WGD model implemented in Whale.
Thank you for your help and patience! I'm concluding my analysis and the issue. I can include the 52 molluscan Ks distribution plots for your reference if you want to go through them.
Hi, just one mixed Ks plot of a high-quality molluscan genome can be already informative. You may try to make such a figure following the Pipeline 5.
Hello,
I'm encountering an issue where the
anchorpoints.txt
file generated by thewgd syn
command is empty after execution. Here’s the specific command I ran:wgd syn --feature CDS --attribute ID wgd_dmd/cds.faa.tsv cds.gff --ks_distribution wgd_ksd/cds.faa.tsv.ks.tsv --outdir wgd_syn --nthreads 50 --gistrb --showrealtick --minseglen 1 --mingenenum 1 --maxsize 200
Problem: After running the above command, the files inside the directory
iadhore-out
:anchorpoints.txt
,alignment.txt
,baseclusters.txt
,list_elements.txt
,multiplicon_pairs.txt
andsegments.txt
are empty. Other output files includinggenes.txt
seem to be generated correctly, but the lack of data inanchorpoints.txt
is causing downstream errors in the commandwgd peak
when I try to use it for further analysis. This is mywgd peak
command:wgd peak --heuristic wgd_ksd/cds.faa.tsv.ks.tsv --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt --segments wgd_syn/iadhore-out/segments.txt --listelements wgd_syn/iadhore-out/list_elements.txt --multipliconpairs wgd_syn/iadhore-out/multiplicon_pairs.txt --outdir wgd_peak --weights_outliers_included
which gives the following error:
Is this a known issue or something related to the input data formatting? Any guidance on how to resolve this would be greatly appreciated.
Thanks for your help!