heche-psb / wgd

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication
https://wgdv2.readthedocs.io/en/latest/
GNU General Public License v3.0
25 stars 0 forks source link

`anchorpoints.txt` is empty after running wgd syn command #44

Open RohanNathHERE opened 5 days ago

RohanNathHERE commented 5 days ago

Hello,

I'm encountering an issue where the anchorpoints.txt file generated by the wgd syn command is empty after execution. Here’s the specific command I ran:

wgd syn --feature CDS --attribute ID wgd_dmd/cds.faa.tsv cds.gff --ks_distribution wgd_ksd/cds.faa.tsv.ks.tsv --outdir wgd_syn --nthreads 50 --gistrb --showrealtick --minseglen 1 --mingenenum 1 --maxsize 200

Problem: After running the above command, the files inside the directory iadhore-out: anchorpoints.txt, alignment.txt, baseclusters.txt, list_elements.txt, multiplicon_pairs.txt and segments.txt are empty. Other output files including genes.txt seem to be generated correctly, but the lack of data in anchorpoints.txt is causing downstream errors in the command wgd peak when I try to use it for further analysis. This is my wgd peak command:

wgd peak --heuristic wgd_ksd/cds.faa.tsv.ks.tsv --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt --segments wgd_syn/iadhore-out/segments.txt --listelements wgd_syn/iadhore-out/list_elements.txt --multipliconpairs wgd_syn/iadhore-out/multiplicon_pairs.txt --outdir wgd_peak --weights_outliers_included

which gives the following error:

2024-09-25 14:17:32 INFO     This is wgd v2.0.38                                                                                                                                                                                                                 cli.py:34
                    INFO     Checking cores and threads...                                                                                                                                                                                                      core.py:35
                    INFO     The number of logical CPUs/Hyper Threading in the system: 80                                                                                                                                                                       core.py:36
                    INFO     The number of physical cores in the system: 20                                                                                                                                                                                     core.py:37
                    INFO     The number of actually usable CPUs in the system: 80                                                                                                                                                                               core.py:38
                    INFO     Checking memory...                                                                                                                                                                                                                 core.py:40
                    INFO     Total physical memory: 251.5101 GB                                                                                                                                                                                                 core.py:41
                    INFO     Available memory: 241.3698 GB                                                                                                                                                                                                      core.py:42
                    INFO     Free memory: 42.2427 GB                                                                                                                                                                                                            core.py:43
2024-09-25 14:17:33 INFO     Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data                                                                                                                                                                     peak.py:1754
Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3629, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'pair'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3799, in _set_item_mgr
    loc = self._info_axis.get_loc(key)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3631, in get_loc
    raise KeyError(key) from err
KeyError: 'pair'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
    _peak(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 372, in _peak
    df_ap_mp = fit_apgmm_guide(hdr,guide,anchorpoints,ksdf,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,segment=segments,multipliconpairs=multipliconpairs,listelement=listelements,cutoff = kscutoff,user_xlim=xlim,user_ylim=ylim,peak_threshold=prominence_cutoff,rel_height=rel_height,keeptmp=keeptmpfig)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1758, in fit_apgmm_guide
    df_ap = get_anchors(anchor)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1557, in get_anchors
    anchors["pair"] = anchors[["gene_x", "gene_y"]].apply(lambda x: "__".join(sorted([x[0], x[1]])), axis=1)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3645, in __setitem__
    self._set_item_frame_value(key, value)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3788, in _set_item_frame_value
    self._set_item_mgr(key, arraylike)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3802, in _set_item_mgr
    self._mgr.insert(len(self._info_axis), key, value)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1245, in insert
    raise ValueError(
ValueError: Expected a 1D array, got an array with shape (0, 2)

Is this a known issue or something related to the input data formatting? Any guidance on how to resolve this would be greatly appreciated.

Thanks for your help!

heche-psb commented 5 days ago

Hi, the fact that you got normal genes.txt output while empty anchorpoints.txt means, in my first thought, that there was no collinearity inferred by i-adhore. Did you try to use other collinearity inference algorithm for instance MCScan to inspect the colliearity? It could be that the intra-collinearity of your genome is scarce.

RohanNathHERE commented 5 days ago

Thank you, I understand the issue now. However, in a few species, I encounter this error:

2024-09-25 17:58:54 INFO     This is wgd v2.0.38                                                                                                                                                                                                                 cli.py:34
                    INFO     Checking cores and threads...                                                                                                                                                                                                      core.py:35
                    INFO     The number of logical CPUs/Hyper Threading in the system: 80                                                                                                                                                                       core.py:36
                    INFO     The number of physical cores in the system: 20                                                                                                                                                                                     core.py:37
                    INFO     The number of actually usable CPUs in the system: 80                                                                                                                                                                               core.py:38
                    INFO     Checking memory...                                                                                                                                                                                                                 core.py:40
                    INFO     Total physical memory: 251.5101 GB                                                                                                                                                                                                 core.py:41
                    INFO     Available memory: 227.2910 GB                                                                                                                                                                                                      core.py:42
                    INFO     Free memory: 47.3290 GB                                                                                                                                                                                                            core.py:43
2024-09-25 17:58:55 INFO     Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data                                                                                                                                                                     peak.py:1754
2024-09-25 17:58:56 INFO     Fitting GMM with 1 components                                                                                                                                                                                                     peak.py:760
                    INFO     Convergence reached                                                                                                                                                                                                               peak.py:763
                    INFO     Component 1 has mean 0.053, std 0.644, weight 1.000, precision 2.411                                                                                                                                                               peak.py:68
                    INFO     Fitting GMM with 2 components                                                                                                                                                                                                     peak.py:760
2024-09-25 17:58:57 INFO     Convergence reached                                                                                                                                                                                                               peak.py:763
                    INFO     Component 1 has mean 0.014, std 0.001, weight 0.067, precision 1000000.000                                                                                                                                                         peak.py:68
                    INFO     Component 2 has mean 0.058, std 0.550, weight 0.933, precision 3.307                                                                                                                                                               peak.py:68
                    INFO     Fitting GMM with 3 components                                                                                                                                                                                                     peak.py:760
                    INFO     Convergence reached                                                                                                                                                                                                               peak.py:763
                    INFO     Component 1 has mean 0.053, std 0.463, weight 0.867, precision 4.671                                                                                                                                                               peak.py:68
                    INFO     Component 2 has mean 0.014, std 0.001, weight 0.067, precision 1000000.000                                                                                                                                                         peak.py:68
                    INFO     Component 3 has mean 0.186, std 0.001, weight 0.067, precision 1000000.000                                                                                                                                                         peak.py:68
                    INFO     Fitting GMM with 4 components                                                                                                                                                                                                     peak.py:760
2024-09-25 17:58:58 INFO     Convergence reached                                                                                                                                                                                                               peak.py:763
                    INFO     Component 1 has mean 0.078, std 0.282, weight 0.416, precision 12.562                                                                                                                                                              peak.py:68
                    INFO     Component 2 has mean 0.014, std 0.001, weight 0.067, precision 1000000.000                                                                                                                                                         peak.py:68
                    INFO     Component 3 has mean 0.038, std 0.290, weight 0.451, precision 11.868                                                                                                                                                              peak.py:68
                    INFO     Component 4 has mean 0.186, std 0.001, weight 0.067, precision 1000000.000                                                                                                                                                         peak.py:68
                    INFO     The best fitted model via AIC is with 3 components                                                                                                                                                                                peak.py:769
                    INFO     Rules-of-thumb (Burnham & Anderson, 2002) compares the AIC-best model and remaining:                                                                                                                                               peak.py:79
                    INFO     model with 1 components gets few support comparing to the AIC-best model                                                                                                                                                           peak.py:95
                    INFO     model with 2 components gets considerably less support comparing to the AIC-best model                                                                                                                                             peak.py:93
                    INFO     model with 4 components gets considerably less support comparing to the AIC-best model                                                                                                                                             peak.py:93
                    INFO     The best fitted model via BIC is with 3 components                                                                                                                                                                                peak.py:771
                    INFO     Rules-of-thumb (Kass & Raftery, 1995) evaluates the outperformance of the BIC-best model over remaining:                                                                                                                          peak.py:101
                    INFO     Such outperformance is positively evidenced for model with 1 components                                                                                                                                                           peak.py:109
                    INFO     Such outperformance is positively evidenced for model with 2 components                                                                                                                                                           peak.py:109
                    INFO     Such outperformance is strongly evidenced for model with 4 components                                                                                                                                                             peak.py:111
2024-09-25 17:59:00 INFO     Detecting likely peaks from node-averaged data of 1-Components Model Peak 0                                                                                                                                                      peak.py:1323
2024-09-25 17:59:02 INFO     Detecting likely peaks from node-averaged data of 2-Components Model Peak 0                                                                                                                                                      peak.py:1323
                    INFO     Detecting likely peaks from node-averaged data of 2-Components Model Peak 1                                                                                                                                                      peak.py:1323
2024-09-25 17:59:05 INFO     Detected one component with less than 2 valid elements, will skip it                                                                                                                                                              peak.py:561
                    INFO     Detected one component with less than 2 valid elements, will skip it                                                                                                                                                              peak.py:561
                    INFO     Detecting likely peaks from node-averaged data of 3-Components Model Peak 0                                                                                                                                                      peak.py:1323
                    INFO     Detecting likely peaks from node-averaged data of 3-Components Model Peak 1                                                                                                                                                      peak.py:1323
mc2024-09-25 17:59:08 INFO     Detected one component with less than 2 valid elements, will skip it                                                                                                                                                              peak.py:561
2024-09-25 17:59:09 INFO     Detected one component with less than 2 valid elements, will skip it                                                                                                                                                              peak.py:561
                    INFO     Detecting likely peaks from node-averaged data of 4-Components Model Peak 0                                                                                                                                                      peak.py:1323
                    INFO     Detecting likely peaks from node-averaged data of 4-Components Model Peak 1                                                                                                                                                      peak.py:1323
                    INFO     Detecting likely peaks from node-averaged data of 4-Components Model Peak 2                                                                                                                                                      peak.py:1323
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py:1831: RuntimeWarning: divide by zero encountered in log
  X_log = np.log(X).reshape(-1, 1)
                    INFO     Gaussian Mixture Modeling (GMM) on Log-scale original anchor Ks data                                                                                                                                                             peak.py:1833
                    INFO     Fitting GMM with 1 components                                                                                                                                                                                                     peak.py:760
Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
    _peak(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 373, in _peak
    df_ap = fit_apgmm_ap(hdr,anchorpoints,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,cutoff = kscutoff,peak_threshold=prominence_cutoff,rel_height=rel_height,user_xlim=xlim,user_ylim=ylim)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1834, in fit_apgmm_ap
    if method == 'gmm': models, aic, bic, besta, bestb, N = fit_gmm(out_file, X_log, seed, components[0], components[1], em_iter=em_iter, n_init=n_init)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 761, in fit_gmm
    models[i-n1] = mixture.GaussianMixture(n_components = i, covariance_type='full', max_iter = em_iter, n_init = n_init, random_state = seed).fit(X)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 193, in fit
    self.fit_predict(X, y)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 220, in fit_predict
    X = _check_X(X, self.n_components, ensure_min_samples=2)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 52, in _check_X
    X = check_array(X, dtype=[np.float64, np.float32],
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 645, in check_array
    _assert_all_finite(array,
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 97, in _assert_all_finite
    raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Could you please explain why this occurs and how I can avoid it?

heche-psb commented 5 days ago

Hi, it's due to the scarce of available anchor Ks values. How many anchor pairs are there and how many of them have Ks values?

RohanNathHERE commented 5 days ago

I found around 76 rows in the anchorpoints.txt file; I guess is_real_anchorpoint = -1 denotes that it is not a real anchorpoint. I’ve attached my result zip file for your review. Did I make any mistakes or misunderstand anything?

wgd_result.zip

heche-psb commented 5 days ago

Hi, the number of anchor pairs is 76, meaning that the number of anchor pairs with Ks within the range from 0 to 5 is even smaller. It's because either the WGD event is very ancient (for instance older than 150 mya) or the WGD retention rate is very low (for instance lower than 0.01), or both. In that case, I will make sure every anchor pair has its Ks value, rather than NaN. You can specifically calculate the Ks for those anchor pairs with NaN Ks values using the pairwise mode in wgd ksd.

RohanNathHERE commented 5 days ago

Thanks! That makes sense. I'll give it a try. However, I'm encountering another error while running wgd peak for another species:

2024-09-25 20:09:01 INFO     This is wgd v2.0.38                                                                                                                                                                                                                 cli.py:34
                    INFO     Checking cores and threads...                                                                                                                                                                                                      core.py:35
                    INFO     The number of logical CPUs/Hyper Threading in the system: 80                                                                                                                                                                       core.py:36
                    INFO     The number of physical cores in the system: 20                                                                                                                                                                                     core.py:37
                    INFO     The number of actually usable CPUs in the system: 80                                                                                                                                                                               core.py:38
                    INFO     Checking memory...                                                                                                                                                                                                                 core.py:40
                    INFO     Total physical memory: 251.5101 GB                                                                                                                                                                                                 core.py:41
                    INFO     Available memory: 226.8054 GB                                                                                                                                                                                                      core.py:42
                    INFO     Free memory: 41.8755 GB                                                                                                                                                                                                            core.py:43
                    INFO     Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data                                                                                                                                                                     peak.py:1754
                    INFO     Fitting GMM with 1 components                                                                                                                                                                                                     peak.py:760
2024-09-25 20:09:02 INFO     Convergence reached                                                                                                                                                                                                               peak.py:763
                    INFO     Component 1 has mean 0.778, std 0.086, weight 1.000, precision 136.689                                                                                                                                                             peak.py:68
                    INFO     Fitting GMM with 2 components                                                                                                                                                                                                     peak.py:760
                    INFO     Convergence reached                                                                                                                                                                                                               peak.py:763
                    INFO     Component 1 has mean 0.714, std 0.001, weight 0.500, precision 1000000.000                                                                                                                                                         peak.py:68
                    INFO     Component 2 has mean 0.847, std 0.001, weight 0.500, precision 1000000.000                                                                                                                                                         peak.py:68
                    INFO     Fitting GMM with 3 components                                                                                                                                                                                                     peak.py:760
Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
    _peak(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 372, in _peak
    df_ap_mp = fit_apgmm_guide(hdr,guide,anchorpoints,ksdf,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,segment=segments,multipliconpairs=multipliconpairs,listelement=listelements,cutoff = kscutoff,user_xlim=xlim,user_ylim=ylim,peak_threshold=prominence_cutoff,rel_height=rel_height,keeptmp=keeptmpfig)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1773, in fit_apgmm_guide
    if method == 'gmm': models, aic, bic, besta, bestb, N = fit_gmm(out_file, X_log, seed, components[0], components[1], em_iter=em_iter, n_init=n_init)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 761, in fit_gmm
    models[i-n1] = mixture.GaussianMixture(n_components = i, covariance_type='full', max_iter = em_iter, n_init = n_init, random_state = seed).fit(X)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 193, in fit
    self.fit_predict(X, y)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 220, in fit_predict
    X = _check_X(X, self.n_components, ensure_min_samples=2)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 55, in _check_X
    raise ValueError('Expected n_samples >= n_components '
ValueError: Expected n_samples >= n_components but got n_components = 3, n_samples = 2

It seems that the number of samples is insufficient for the number of components in the Gaussian Mixture Model. Do you have any suggestions on how to resolve this issue?

heche-psb commented 5 days ago

Hi, it showed that there were only two segment-wise Ks values, which made it impossible to cluster them into 3 clusters. It's also linked with scarce collinearity.

RohanNathHERE commented 5 days ago

Thanks! I encountered another error while running wgd peak, which seems to be related to the Gaussian Mixture Model and silhouette score calculation:

Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
    _peak(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 372, in _peak
    df_ap_mp = fit_apgmm_guide(hdr, guide, anchorpoints, ksdf, ksdf_filtered, seed, components, em_iter, n_init, outdir, method, gamma, weighted, plot, segment=segments, multipliconpairs=multipliconpairs, listelement=listelements, cutoff=kscutoff, user_xlim=xlim, user_ylim=ylim, peak_threshold=prominence_cutoff, rel_height=rel_height, keeptmp=keeptmpfig)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1776, in fit_apgmm_guide
    plot_silhouette_score(X_log, components[0]+1, components[1], [m.predict(X_log) for m in models][1:], outdir, guide+'_Ks', 'GMM')
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 148, in plot_silhouette_score
    scores = [metrics.silhouette_score(X, label) for label in labels]
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 148, in <listcomp>
    scores = [metrics.silhouette_score(X, label) for label in labels]
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 229, in silhouette_samples
    check_number_of_labels(len(le.classes_), n_samples)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 34, in check_number_of_labels
    raise ValueError("Number of labels is %d. Valid values are 2 to n_samples - 1 (inclusive)" % n_labels)

ValueError: Number of labels is 4. Valid values are 2 to n_samples - 1 (inclusive).

This error indicates that there are four labels in the dataset, but the number of samples is insufficient for this configuration. Specifically, the number of valid labels must be between 2 and n_samples−1. Since the dataset may not have enough samples, I’m unsure how to proceed. Do you have any advice on resolving this?

heche-psb commented 5 days ago

Hi, I think the problem is also at the only two available segment-wise Ks. If you set the number of components as --components 1 1 or --components 1 2, it might be solved.

RohanNathHERE commented 4 days ago

Thanks for the clarification! I believe this error is due to the same reasons as listed above?

2024-09-25 20:23:55 INFO     This is wgd v2.0.38                                                                                                                                                                                                                 cli.py:34
                    INFO     Checking cores and threads...                                                                                                                                                                                                      core.py:35
                    INFO     The number of logical CPUs/Hyper Threading in the system: 80                                                                                                                                                                       core.py:36
                    INFO     The number of physical cores in the system: 20                                                                                                                                                                                     core.py:37
                    INFO     The number of actually usable CPUs in the system: 80                                                                                                                                                                               core.py:38
                    INFO     Checking memory...                                                                                                                                                                                                                 core.py:40
                    INFO     Total physical memory: 251.5101 GB                                                                                                                                                                                                 core.py:41
                    INFO     Available memory: 226.9499 GB                                                                                                                                                                                                      core.py:42
                    INFO     Free memory: 41.7877 GB                                                                                                                                                                                                            core.py:43
2024-09-25 20:23:56 INFO     Gaussian Mixture Modeling (GMM) on Log-scale Segment Ks data                                                                                                                                                                     peak.py:1754
                    INFO     Fitting GMM with 1 components                                                                                                                                                                                                     peak.py:760
Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 351, in peak
    _peak(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 372, in _peak
    df_ap_mp = fit_apgmm_guide(hdr,guide,anchorpoints,ksdf,ksdf_filtered,seed,components,em_iter,n_init,outdir,method,gamma,weighted,plot,segment=segments,multipliconpairs=multipliconpairs,listelement=listelements,cutoff = kscutoff,user_xlim=xlim,user_ylim=ylim,peak_threshold=prominence_cutoff,rel_height=rel_height,keeptmp=keeptmpfig)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 1773, in fit_apgmm_guide
    if method == 'gmm': models, aic, bic, besta, bestb, N = fit_gmm(out_file, X_log, seed, components[0], components[1], em_iter=em_iter, n_init=n_init)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/peak.py", line 761, in fit_gmm
    models[i-n1] = mixture.GaussianMixture(n_components = i, covariance_type='full', max_iter = em_iter, n_init = n_init, random_state = seed).fit(X)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 193, in fit
    self.fit_predict(X, y)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 220, in fit_predict
    X = _check_X(X, self.n_components, ensure_min_samples=2)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 52, in _check_X
    X = check_array(X, dtype=[np.float64, np.float32],
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/sklearn/utils/validation.py", line 651, in check_array
    raise ValueError("Found array with %d sample(s) (shape=%s) while a"
ValueError: Found array with 1 sample(s) (shape=(1, 1)) while a minimum of 2 is required.
heche-psb commented 4 days ago

Hi, the source of the bug is the 2 segment-wise Ks in your data, which is too few.

RohanNathHERE commented 4 days ago

Hi, so I was running wgd ksd as per your suggestion using the --pairwise parameter, but I'm encountering the following issue:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 350, in __call__
    return self.func(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py", line 3221, in _get_ks
    family.get_ks()
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py", line 3066, in get_ks
    self.run_codeml()
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py", line 3145, in run_codeml
    result, no_result = codeml.run_codeml_pairwise(
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/codeml.py", line 322, in run_codeml_pairwise
    return pd.concat(results), no_results
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 347, in concat
    op = _Concatenator(
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 404, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 359, in __call__
    raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException
___________________________________________________________________________
ValueError                                         Thu Sep 26 16:14:39 2024
PID: 174911Python 3.8.19: /home/workspace2/anaconda3/envs/new_wgd_env/bin/python3.8
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _get_ks>, (<wgd.core.GeneFamily object>,), {})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _get_ks>
        args = (<wgd.core.GeneFamily object>,)
        kwargs = {}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in _get_ks(family=<wgd.core.GeneFamily object>)
   3216     node_averaged_dS_exc = df.groupby(["family", "node"])["dS"].mean()
   3217     node_averaged_dS_exc = node_averaged_dS_exc.to_frame(name='node_averaged_dS_outlierexcluded')
   3218     return node_averaged_dS_exc
   3219 
   3220 def _get_ks(family):
-> 3221     family.get_ks()
        family.get_ks = <bound method GeneFamily.get_ks of <wgd.core.GeneFamily object>>
   3222     if family.codeml_results.shape[1] !=3:
   3223         weight_inc = get_outlierincluded(family.codeml_results)
   3224         weight_exc = get_outlierexcluded(family.codeml_results,cutoff = 5)
   3225         node_averaged_dS_inc = get_nodeaverged_dS_outlierincluded(family.codeml_results)

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in get_ks(self=<wgd.core.GeneFamily object>)
   3061         self.pairwise = pairwise
   3062 
   3063     def get_ks(self):
   3064         logging.info("Analysing family {}".format(self.id))
   3065         self.align()
-> 3066         self.run_codeml()
        self.run_codeml = <bound method GeneFamily.run_codeml of <wgd.core.GeneFamily object>>
   3067         if self.codeml_results is not None:
   3068             self.get_tree()
   3069             self.compile_dataframe()
   3070         self.combine_results()

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in run_codeml(self=<wgd.core.GeneFamily object>)
   3140 
   3141     def run_codeml(self):
   3142         codeml = Codeml(self.cds_aln, exe="codeml", tmp=self.tmp_path, prefix=self.id)
   3143         # TODO, do something with `no_result`
   3144         if self.pairwise:
-> 3145             result, no_result = codeml.run_codeml_pairwise(
        result = undefined
        no_result = undefined
        codeml.run_codeml_pairwise = <bound method Codeml.run_codeml_pairwise of <wgd.codeml.Codeml object>>
        self.codeml_iter = 1
   3146                     preserve=True, times=self.codeml_iter)
   3147         else:
   3148             result, no_result = codeml.run_codeml(
   3149                     preserve=True, times=self.codeml_iter)

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/codeml.py in run_codeml_pairwise(self=<wgd.codeml.Codeml object>, **kwargs={'preserve': True, 'times': 1})
    317                         results.append(tmp_result)
    318         if len(no_results) > 0:
    319             logging.warning("Alignment length of 0 for {} pairs in {}".format(
    320                 len(no_results), self.prefix))
    321         os.chdir(parentdir)
--> 322         return pd.concat(results), no_results
        results = []
        no_results = [['cds.faa_07813', 'cds.faa_00553'], ['cds.faa_07813', 'cds.faa_05156'], ['cds.faa_07813', 'cds.faa_04582'], ['cds.faa_07813', 'cds.faa_07258'], ['cds.faa_07813', 'cds.faa_05069'], ['cds.faa_07813', 'cds.faa_06318'], ['cds.faa_07813', 'cds.faa_01522'], ['cds.faa_07813', 'cds.faa_00015'], ['cds.faa_07813', 'cds.faa_08322'], ['cds.faa_07813', 'cds.faa_00130'], ['cds.faa_07813', 'cds.faa_04206'], ['cds.faa_07813', 'cds.faa_00708'], ['cds.faa_07813', 'cds.faa_02030'], ['cds.faa_07813', 'cds.faa_02258'], ['cds.faa_07813', 'cds.faa_06801'], ['cds.faa_07813', 'cds.faa_02732'], ['cds.faa_07813', 'cds.faa_00872'], ['cds.faa_07813', 'cds.faa_07008'], ['cds.faa_07813', 'cds.faa_05281'], ['cds.faa_07813', 'cds.faa_03598'], ...]
    323 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args=([],), **kwargs={})
    306                 warnings.warn(
    307                     msg.format(arguments=arguments),
    308                     FutureWarning,
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
        args = ([],)
        kwargs = {}
    312 
    313         return wrapper
    314 
    315     return decorate

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs=[], axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
    342     >>> pd.concat([df5, df6], verify_integrity=True)
    343     Traceback (most recent call last):
    344         ...
    345     ValueError: Indexes have overlapping values: ['a']
    346     """
--> 347     op = _Concatenator(
        op = undefined
        objs = []
        axis = 0
        ignore_index = False
        join = 'outer'
        keys = None
        levels = None
        names = None
        verify_integrity = False
        copy = True
        sort = False
    348         objs,
    349         axis=axis,
    350         ignore_index=ignore_index,
    351         join=join,

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in __init__(self=<pandas.core.reshape.concat._Concatenator object>, objs=[], axis=0, join='outer', keys=None, levels=None, names=None, ignore_index=False, verify_integrity=False, copy=True, sort=False)
    399             objs = [objs[k] for k in keys]
    400         else:
    401             objs = list(objs)
    402 
    403         if len(objs) == 0:
--> 404             raise ValueError("No objects to concatenate")
    405 
    406         if keys is None:
    407             objs = list(com.not_none(*objs))
    408         else:

ValueError: No objects to concatenate
___________________________________________________________________________
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
joblib.my_exceptions.TransportableException: TransportableException
___________________________________________________________________________
ValueError                                         Thu Sep 26 16:14:39 2024
PID: 174911Python 3.8.19: /home/workspace2/anaconda3/envs/new_wgd_env/bin/python3.8
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _get_ks>, (<wgd.core.GeneFamily object>,), {})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _get_ks>
        args = (<wgd.core.GeneFamily object>,)
        kwargs = {}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in _get_ks(family=<wgd.core.GeneFamily object>)
   3216     node_averaged_dS_exc = df.groupby(["family", "node"])["dS"].mean()
   3217     node_averaged_dS_exc = node_averaged_dS_exc.to_frame(name='node_averaged_dS_outlierexcluded')
   3218     return node_averaged_dS_exc
   3219 
   3220 def _get_ks(family):
-> 3221     family.get_ks()
        family.get_ks = <bound method GeneFamily.get_ks of <wgd.core.GeneFamily object>>
   3222     if family.codeml_results.shape[1] !=3:
   3223         weight_inc = get_outlierincluded(family.codeml_results)
   3224         weight_exc = get_outlierexcluded(family.codeml_results,cutoff = 5)
   3225         node_averaged_dS_inc = get_nodeaverged_dS_outlierincluded(family.codeml_results)

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in get_ks(self=<wgd.core.GeneFamily object>)
   3061         self.pairwise = pairwise
   3062 
   3063     def get_ks(self):
   3064         logging.info("Analysing family {}".format(self.id))
   3065         self.align()
-> 3066         self.run_codeml()
        self.run_codeml = <bound method GeneFamily.run_codeml of <wgd.core.GeneFamily object>>
   3067         if self.codeml_results is not None:
   3068             self.get_tree()
   3069             self.compile_dataframe()
   3070         self.combine_results()

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in run_codeml(self=<wgd.core.GeneFamily object>)
   3140 
   3141     def run_codeml(self):
   3142         codeml = Codeml(self.cds_aln, exe="codeml", tmp=self.tmp_path, prefix=self.id)
   3143         # TODO, do something with `no_result`
   3144         if self.pairwise:
-> 3145             result, no_result = codeml.run_codeml_pairwise(
        result = undefined
        no_result = undefined
        codeml.run_codeml_pairwise = <bound method Codeml.run_codeml_pairwise of <wgd.codeml.Codeml object>>
        self.codeml_iter = 1
   3146                     preserve=True, times=self.codeml_iter)
   3147         else:
   3148             result, no_result = codeml.run_codeml(
   3149                     preserve=True, times=self.codeml_iter)

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/codeml.py in run_codeml_pairwise(self=<wgd.codeml.Codeml object>, **kwargs={'preserve': True, 'times': 1})
    317                         results.append(tmp_result)
    318         if len(no_results) > 0:
    319             logging.warning("Alignment length of 0 for {} pairs in {}".format(
    320                 len(no_results), self.prefix))
    321         os.chdir(parentdir)
--> 322         return pd.concat(results), no_results
        results = []
        no_results = [['cds.faa_07813', 'cds.faa_00553'], ['cds.faa_07813', 'cds.faa_05156'], ['cds.faa_07813', 'cds.faa_04582'], ['cds.faa_07813', 'cds.faa_07258'], ['cds.faa_07813', 'cds.faa_05069'], ['cds.faa_07813', 'cds.faa_06318'], ['cds.faa_07813', 'cds.faa_01522'], ['cds.faa_07813', 'cds.faa_00015'], ['cds.faa_07813', 'cds.faa_08322'], ['cds.faa_07813', 'cds.faa_00130'], ['cds.faa_07813', 'cds.faa_04206'], ['cds.faa_07813', 'cds.faa_00708'], ['cds.faa_07813', 'cds.faa_02030'], ['cds.faa_07813', 'cds.faa_02258'], ['cds.faa_07813', 'cds.faa_06801'], ['cds.faa_07813', 'cds.faa_02732'], ['cds.faa_07813', 'cds.faa_00872'], ['cds.faa_07813', 'cds.faa_07008'], ['cds.faa_07813', 'cds.faa_05281'], ['cds.faa_07813', 'cds.faa_03598'], ...]
    323 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args=([],), **kwargs={})
    306                 warnings.warn(
    307                     msg.format(arguments=arguments),
    308                     FutureWarning,
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
        args = ([],)
        kwargs = {}
    312 
    313         return wrapper
    314 
    315     return decorate

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs=[], axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
    342     >>> pd.concat([df5, df6], verify_integrity=True)
    343     Traceback (most recent call last):
    344         ...
    345     ValueError: Indexes have overlapping values: ['a']
    346     """
--> 347     op = _Concatenator(
        op = undefined
        objs = []
        axis = 0
        ignore_index = False
        join = 'outer'
        keys = None
        levels = None
        names = None
        verify_integrity = False
        copy = True
        sort = False
    348         objs,
    349         axis=axis,
    350         ignore_index=ignore_index,
    351         join=join,

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in __init__(self=<pandas.core.reshape.concat._Concatenator object>, objs=[], axis=0, join='outer', keys=None, levels=None, names=None, ignore_index=False, verify_integrity=False, copy=True, sort=False)
    399             objs = [objs[k] for k in keys]
    400         else:
    401             objs = list(objs)
    402 
    403         if len(objs) == 0:
--> 404             raise ValueError("No objects to concatenate")
    405 
    406         if keys is None:
    407             objs = list(com.not_none(*objs))
    408         else:

ValueError: No objects to concatenate
___________________________________________________________________________

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 469, in ksd
    _ksd(**kwargs)
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py", line 507, in _ksd
    ksdb.get_distribution()
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py", line 3242, in get_distribution
    Parallel(n_jobs=self.n_threads,backend='multiprocessing')(
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py", line 740, in retrieve
    raise exception
joblib.my_exceptions.JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/bin/wgd in <module>()
      3 import re
      4 import sys
      5 from cli import cli
      6 if __name__ == '__main__':
      7     sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
----> 8     sys.exit(cli())

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in __call__(self=<Group cli>, *args=(), **kwargs={})
    824             echo("Aborted!", file=sys.stderr)
    825             sys.exit(1)
    826 
    827     def __call__(self, *args, **kwargs):
    828         """Alias for :meth:`main`."""
--> 829         return self.main(*args, **kwargs)
        self.main = <bound method BaseCommand.main of <Group cli>>
        args = ()
        kwargs = {}
    830 
    831 
    832 class Command(BaseCommand):
    833     """Commands are the basic building block of command line interfaces in

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in main(self=<Group cli>, args=['ksd', 'wgd_dmd/cds.faa.tsv', 'cds.faa', '--cds', '--strip_gaps', '--outdir', 'wgd_ksd', '--nthreads', '8', '--pairwise'], prog_name='wgd', complete_var=None, standalone_mode=True, **extra={})
    777         _bashcomplete(self, prog_name, complete_var)
    778 
    779         try:
    780             try:
    781                 with self.make_context(prog_name, args, **extra) as ctx:
--> 782                     rv = self.invoke(ctx)
        rv = undefined
        self.invoke = <bound method MultiCommand.invoke of <Group cli>>
        ctx = <click.core.Context object>
    783                     if not standalone_mode:
    784                         return rv
    785                     # it's not safe to `ctx.exit(rv)` here!
    786                     # note that `rv` may actually contain data like "1" which

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in invoke(self=<Group cli>, ctx=<click.core.Context object>)
   1254                 cmd_name, cmd, args = self.resolve_command(ctx, args)
   1255                 ctx.invoked_subcommand = cmd_name
   1256                 Command.invoke(self, ctx)
   1257                 sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)
   1258                 with sub_ctx:
-> 1259                     return _process_result(sub_ctx.command.invoke(sub_ctx))
        _process_result = <function MultiCommand.invoke.<locals>._process_result>
        sub_ctx.command.invoke = <bound method Command.invoke of <Command ksd>>
        sub_ctx = <click.core.Context object>
   1260 
   1261         # In chain mode we create the contexts step by step, but after the
   1262         # base command has been invoked.  Because at that point we do not
   1263         # know the subcommands yet, the invoked subcommand attribute is

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in invoke(self=<Command ksd>, ctx=<click.core.Context object>)
   1061         """Given a context, this invokes the attached callback (if it exists)
   1062         in the right way.
   1063         """
   1064         _maybe_show_deprecated_notice(self)
   1065         if self.callback is not None:
-> 1066             return ctx.invoke(self.callback, **ctx.params)
        ctx.invoke = <bound method Context.invoke of <click.core.Context object>>
        self.callback = <function ksd>
        ctx.params = {'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...}
   1067 
   1068 
   1069 class MultiCommand(Command):
   1070     """A multi command is the basic implementation of a command that

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/click/core.py in invoke(*args=(), **kwargs={'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...})
    605                     kwargs[param.name] = param.get_default(ctx)
    606 
    607         args = args[2:]
    608         with augment_usage_errors(self):
    609             with ctx:
--> 610                 return callback(*args, **kwargs)
        callback = <function ksd>
        args = ()
        kwargs = {'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...}
    611 
    612     def forward(*args, **kwargs):  # noqa: B902
    613         """Similar to :meth:`invoke` but fills in default keyword
    614         arguments from the current context if the other command expects

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py in ksd(**kwargs={'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...})
    464         wgd ksd orthologs.rbh cds1.fasta cds2.fasta
    465 
    466     If you want to keep intermediate (temporary) files, please provide a directory
    467     name for the `--tmpdir` option.
    468     """
--> 469     _ksd(**kwargs)
        kwargs = {'adjustfactor': 1.0, 'adjustortho': False, 'aligner': 'mafft', 'aln_options': '--auto', 'anchorpoints': None, 'bootstrap': 200, 'cds': True, 'classic': False, 'components': (1, 4), 'extraparanomeks': None, ...}
    470 
    471 def _ksd(families, sequences, outdir, tmpdir, nthreads, to_stop, cds, pairwise,
    472         strip_gaps, aligner, aln_options, tree_method, tree_options, node_average, spair, speciestree, reweight, onlyrootout, extraparanomeks, anchorpoints, plotkde, plotapgmm, plotelmm, components,xlim,ylim,adjustortho,adjustfactor,okalpha,focus2all,kstree,onlyconcatkstree,classic,toparrow, bootstrap):
    473     from wgd.core import get_gene_families, SequenceData, KsDistributionBuilder

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/cli.py in _ksd(families='wgd_dmd/cds.faa.tsv', sequences=('cds.faa',), outdir='wgd_ksd', tmpdir=None, nthreads=8, to_stop=False, cds=True, pairwise=True, strip_gaps=True, aligner='mafft', aln_options='--auto', tree_method='fasttree', tree_options=None, node_average=False, spair=(), speciestree=None, reweight=False, onlyrootout=False, extraparanomeks=None, anchorpoints=None, plotkde=False, plotapgmm=False, plotelmm=False, components=(1, 4), xlim=(None, None), ylim=(None, None), adjustortho=False, adjustfactor=1.0, okalpha=0.5, focus2all=None, kstree=False, onlyconcatkstree=False, classic=False, toparrow=False, bootstrap=200)
    502         #getconcataln(s, families, nthreads, outdir, option="--auto")
    503         #cds_alns, pro_alns, tree_famsf, calnfs, palnfs, calnfs_length, cds_fastaf, tree_fams = get_MultipRBH_gene_families(seqs,fams,tree_method,treeset,outdir,nthreads,option="--auto",runtree=False)
    504         #calculatekstree(fams,s,spgenemap)
    505         #exit()
    506     ksdb = KsDistributionBuilder(fams, s, n_threads=nthreads)
--> 507     ksdb.get_distribution()
        ksdb.get_distribution = <bound method KsDistributionBuilder.get_distribution of <wgd.core.KsDistributionBuilder object>>
    508     prefix = os.path.basename(families)
    509     outfile = os.path.join(outdir, "{}.ks.tsv".format(prefix))
    510     logging.info("Saving to {}".format(outfile))
    511     ksdb.df.fillna("NaN").to_csv(outfile,sep="\t")

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in get_distribution(self=<wgd.core.KsDistributionBuilder object>)
   3237         self.seqs = seqs
   3238         self.n_threads = n_threads
   3239 
   3240     def get_distribution(self):
   3241         if self.n_threads < len(self.families): logging.info("{} threads are used for {} gene families\nNote that adding threads can significantly accelerate the Ks estimation process".format(int(self.n_threads),int(len(self.families))))
-> 3242         Parallel(n_jobs=self.n_threads,backend='multiprocessing')(
        self.n_threads = 8
        self.families = [<wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, <wgd.core.GeneFamily object>, ...]
   3243             delayed(_get_ks)(family) for family in self.families)
   3244         df = pd.concat([pd.read_csv(x.out, index_col=None) 
   3245             for x in self.families], sort=True)
   3246         self.df = add_original_ids(df, self.seqs)

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in __call__(self=Parallel(n_jobs=8), iterable=<generator object KsDistributionBuilder.get_distribution.<locals>.<genexpr>>)
    784             if pre_dispatch == "all" or n_jobs == 1:
    785                 # The iterable was consumed all at once by the above for loop.
    786                 # No need to wait for async callbacks to trigger to
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=8)>
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time
    792             self._print('Done %3i out of %3i | elapsed: %s finished',
    793                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError                                         Thu Sep 26 16:14:39 2024
PID: 174911Python 3.8.19: /home/workspace2/anaconda3/envs/new_wgd_env/bin/python3.8
...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _get_ks>, (<wgd.core.GeneFamily object>,), {})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _get_ks>
        args = (<wgd.core.GeneFamily object>,)
        kwargs = {}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in _get_ks(family=<wgd.core.GeneFamily object>)
   3216     node_averaged_dS_exc = df.groupby(["family", "node"])["dS"].mean()
   3217     node_averaged_dS_exc = node_averaged_dS_exc.to_frame(name='node_averaged_dS_outlierexcluded')
   3218     return node_averaged_dS_exc
   3219 
   3220 def _get_ks(family):
-> 3221     family.get_ks()
        family.get_ks = <bound method GeneFamily.get_ks of <wgd.core.GeneFamily object>>
   3222     if family.codeml_results.shape[1] !=3:
   3223         weight_inc = get_outlierincluded(family.codeml_results)
   3224         weight_exc = get_outlierexcluded(family.codeml_results,cutoff = 5)
   3225         node_averaged_dS_inc = get_nodeaverged_dS_outlierincluded(family.codeml_results)

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in get_ks(self=<wgd.core.GeneFamily object>)
   3061         self.pairwise = pairwise
   3062 
   3063     def get_ks(self):
   3064         logging.info("Analysing family {}".format(self.id))
   3065         self.align()
-> 3066         self.run_codeml()
        self.run_codeml = <bound method GeneFamily.run_codeml of <wgd.core.GeneFamily object>>
   3067         if self.codeml_results is not None:
   3068             self.get_tree()
   3069             self.compile_dataframe()
   3070         self.combine_results()

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/core.py in run_codeml(self=<wgd.core.GeneFamily object>)
   3140 
   3141     def run_codeml(self):
   3142         codeml = Codeml(self.cds_aln, exe="codeml", tmp=self.tmp_path, prefix=self.id)
   3143         # TODO, do something with `no_result`
   3144         if self.pairwise:
-> 3145             result, no_result = codeml.run_codeml_pairwise(
        result = undefined
        no_result = undefined
        codeml.run_codeml_pairwise = <bound method Codeml.run_codeml_pairwise of <wgd.codeml.Codeml object>>
        self.codeml_iter = 1
   3146                     preserve=True, times=self.codeml_iter)
   3147         else:
   3148             result, no_result = codeml.run_codeml(
   3149                     preserve=True, times=self.codeml_iter)

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/wgd/codeml.py in run_codeml_pairwise(self=<wgd.codeml.Codeml object>, **kwargs={'preserve': True, 'times': 1})
    317                         results.append(tmp_result)
    318         if len(no_results) > 0:
    319             logging.warning("Alignment length of 0 for {} pairs in {}".format(
    320                 len(no_results), self.prefix))
    321         os.chdir(parentdir)
--> 322         return pd.concat(results), no_results
        results = []
        no_results = [['cds.faa_07813', 'cds.faa_00553'], ['cds.faa_07813', 'cds.faa_05156'], ['cds.faa_07813', 'cds.faa_04582'], ['cds.faa_07813', 'cds.faa_07258'], ['cds.faa_07813', 'cds.faa_05069'], ['cds.faa_07813', 'cds.faa_06318'], ['cds.faa_07813', 'cds.faa_01522'], ['cds.faa_07813', 'cds.faa_00015'], ['cds.faa_07813', 'cds.faa_08322'], ['cds.faa_07813', 'cds.faa_00130'], ['cds.faa_07813', 'cds.faa_04206'], ['cds.faa_07813', 'cds.faa_00708'], ['cds.faa_07813', 'cds.faa_02030'], ['cds.faa_07813', 'cds.faa_02258'], ['cds.faa_07813', 'cds.faa_06801'], ['cds.faa_07813', 'cds.faa_02732'], ['cds.faa_07813', 'cds.faa_00872'], ['cds.faa_07813', 'cds.faa_07008'], ['cds.faa_07813', 'cds.faa_05281'], ['cds.faa_07813', 'cds.faa_03598'], ...]
    323 

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args=([],), **kwargs={})
    306                 warnings.warn(
    307                     msg.format(arguments=arguments),
    308                     FutureWarning,
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
        args = ([],)
        kwargs = {}
    312 
    313         return wrapper
    314 
    315     return decorate

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs=[], axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
    342     >>> pd.concat([df5, df6], verify_integrity=True)
    343     Traceback (most recent call last):
    344         ...
    345     ValueError: Indexes have overlapping values: ['a']
    346     """
--> 347     op = _Concatenator(
        op = undefined
        objs = []
        axis = 0
        ignore_index = False
        join = 'outer'
        keys = None
        levels = None
        names = None
        verify_integrity = False
        copy = True
        sort = False
    348         objs,
    349         axis=axis,
    350         ignore_index=ignore_index,
    351         join=join,

...........................................................................
/home/workspace2/anaconda3/envs/new_wgd_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py in __init__(self=<pandas.core.reshape.concat._Concatenator object>, objs=[], axis=0, join='outer', keys=None, levels=None, names=None, ignore_index=False, verify_integrity=False, copy=True, sort=False)
    399             objs = [objs[k] for k in keys]
    400         else:
    401             objs = list(objs)
    402 
    403         if len(objs) == 0:
--> 404             raise ValueError("No objects to concatenate")
    405 
    406         if keys is None:
    407             objs = list(com.not_none(*objs))
    408         else:

ValueError: No objects to concatenate
___________________________________________________________________________

Any idea on how to solve this?

heche-psb commented 4 days ago

Hi, what is your full command? It seems that your gene family file might have some issues.

RohanNathHERE commented 3 days ago

This is my wgd ksd command:: wgd ksd wgd_dmd/cds.faa.tsv cds.faa --cds --strip_gaps --outdir wgd_ksd --nthreads 8 --pairwise

and this was my wgd dmd command: wgd dmd cds.faa --eval 1e-5 --inflation 1.5 --nthreads 50 --keepfasta --cscore 0.7 --orthoinfer --outdir wgd_dmd

I'm attaching my wgd dmd output for your reference: wgd_dmd.zip

Thank you for the help!

heche-psb commented 3 days ago

Hi, I suggest removing the flag option --cds and --strip_gaps and try again. It might be that there was no strict cds left for some gene families. To make sure as many NaN as possible can be successfully turned into a meaningful Ks value for your anchor pairs, you should not strip gaps.

RohanNathHERE commented 3 days ago

Thank you for the clarification. I ran the script without the --cds and --strip_gaps parameters and added the --pairwise option, but I’m still not getting sufficient anchor Ks values. Let me elaborate on my objective: I am investigating WGD in 52 different molluscan species and hypothesize that WGD is highly unlikely. This hypothesis seems to be supported by your tool, as we observed a lack of collinearity between gene pairs, which likely suggests the absence of WGD.

Is my inference correct, or could I be overlooking something important? Additionally, is there any other data or analysis I could include to strengthen my hypothesis?

heche-psb commented 3 days ago

Hi, scarce collinearity is a golden evidence for the absence of WGDs. You can conclude that your molluscan genomes are lack of any recent WGDs. But the alternative hypothesis, that there might be very ancient WGDs, needs to be formally tested using phylogenomic methods such as Whale or at least rate-corrected Ks distribution. It depends on how deep in the phylogeny you want to claim the absence of WGDs. If you want to demonstrate that all molluscans haven't experienced any WGD event since their origin, you can compare the divergence time of molluscans and their closest relatives to the Ks peak of molluscan genomes if any in a mixed Ks plot or test the WGD occurence in the crown node of molluscans using the DL+WGD model implemented in Whale.

RohanNathHERE commented 15 hours ago

Thank you for your help and patience! I'm concluding my analysis and the issue. I can include the 52 molluscan Ks distribution plots for your reference if you want to go through them.

heche-psb commented 14 hours ago

Hi, just one mixed Ks plot of a high-quality molluscan genome can be already informative. You may try to make such a figure following the Pipeline 5.