Wgd_syn error (File not found anchorpoints.txt)

manoharbisht1998 commented 12 months ago

Hi, Thanks for this commendable tool to detect Whole Genome Duplication I installed the tool and it gave the successful results for wgd ksd for the whole paranome analysis and now I tried to run wgd syn command to detect the anchor ks distribution but i encountered with following issue; FileNotFoundError: [Errno 2] No such file or directory: '/mnt/HDD1/WGD/wgd_syn/iadhore-out/anchorpoints.txt'

A snippet of the whole error is here Write statistics = false
Alignment method = GreedyGraphbased4
Multiple hypothesis correction = FDR
Number of threads = 1
Compare aligners = false
Collinear searches only
Visualize GHM.png = false
Visualize Alignment = false
Verbose output = true
**** END i-AdDHoRe parameters *****

              Creating dataset...                                           
     INFO     Processing I-ADHoRe output                          cli.py:652

Traceback (most recent call last): File "/home/samuelG/.conda/envs/samuel/bin/wgd", line 8, in sys.exit(cli()) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 610, in invoke return callback(args, kwargs) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/cli.py", line 613, in syn _syn(kwargs) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/cli.py", line 654, in _syn anchors,orig_anchors = get_anchors(out_path) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/wgd/syn.py", line 181, in get_anchors else: anchors = pd.read_csv(os.path.join(out_path, "anchorpoints.txt"), sep="\t", index_col=0) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, kwargs) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv return _read(filepath_or_buffer, kwds) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 482, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 811, in init self._engine = self._make_engine(self.engine) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine return mapping[engine](self.f, **self.options) # type: ignore[call-arg] File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in init self._open_handles(src, kwds) File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py", line 229, in _open_handles errors=kwds.get("encoding_errors", "strict"), File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/common.py", line 707, in get_handle newline="", FileNotFoundError: [Errno 2] No such file or directory: '/mnt/HDD1/WGD/wgd_syn/iadhore-out/anchorpoints.txt'

heche-psb commented 12 months ago

Hi, thanks for the interest in wgd v2. The error message indicates that the output file anchorpoints.txt was not properly produced by the external software i-adhore. It might be that no anchor was found given the gene family, feature, attribute and gff3 file. Could you please share me the full log of your command? Besides, I suggest using the latest version v2.0.19.

manoharbisht1998 commented 11 months ago

Hi, thanks for the quick reply. Yes, I am using the latest version (V2.0.19) Here is the log file log.txt

Kindly help Thanks

heche-psb commented 11 months ago

Hi, it appears that there are some errors when parsing your gff3 file. I have met the same error as yours previously, which I found was caused by the misformatted gff3 file itself. Except for the comment line (starting with "#" or empty line), is your gff3 file strictly separated by tab ('\t') and no continuous tab "\t\t"? The message No orientation +/- in gene list. indicates that the number of tabs per line is not exactly equal to 8 or continuous tabs existed. A normal line in gff3 file should look like below:

LG7 EVM gene 5775617 5778078 . + . ID=EVM0040957;

The format is 9 columns, separated by tab (and no continuous tab "\t\t").

manoharbisht1998 commented 11 months ago

Thanks a ton for the constructive suggestion, it worked! so now I successfully did the wgd syn analysis that result in the following result Syndepth.pdf final_cds.fa.tsv.ksd.pdf and also run wgd peak to mixture model clustering for anchor KS and I got the many files. But now I am not sure what file from wgd peak we have to take to make inferences on ks peak and what is the exact ks value. Here I am attaching some results Original_AnchorKs_GMM_Component3_node_weighted_Lognormal.pdf Also, is there anyway, I can change the intervals of the x-axis and y-axis because the figure looks very informal. Kindly help!

heche-psb commented 11 months ago

Hi, the purpose of the function wgd peak is mainly for mixture modeling and WGD dating.

The files pertaining to the "Original Anchor Ks" are the mixture modeling results from the original anchor Ks (from the anchorpoints.txt file of i-adhore). While the files pertaining to "Segment Ks" are the mixture modeling results from the collinear segment Ks calculated as the median Ks values of all residing sytenlogs (from the multiplicon_pairs.txt file of i-adhore). If you don't do WGD dating, you might only need the log-scale GMM results of original anchor Ks.

It seems that you're using the version on PYPI which is not up-to-date that I have modified some plotting functions on the github source code in wgd peak to beautify it for instance preventing the y-limit from overflow. I will soon push a new version of v2.0.20 to PYPI. But I think adding an option to let users define the x-axis, y-axis and legend is a good idea, which I will integrate into v2.0.20. For now, if you want to manually change the axis, you can use the data file Original_AnchorKs_GMM_3components_prediction.tsv to remake the plot using library such as matplotlib or seaborn.

manoharbisht1998 commented 11 months ago

Thanks for the great insights. For your information, I am using wgd v2.0.19

so, the plot obtained from wgd viz -d wgd_ksd/final_cds.fa.tsv.ks.tsv which is elmm_final_cds.fa.tsv.ks.tsv_best_models_node_averaged.pdf now here I am speculating that ks value for the whole paranome is 0.46, am I correct? and is it the right final file I have to look for the ks value determination? Thanks

heche-psb commented 11 months ago

For the detailed information about elmm plot, I suggest reading the paper reporting the software ksrates. For the results given your data, yes, it does show the evidence of a whole-genome duplication event at the Ks age 0.46. So next you can do the relative and absolute timing of this identified WGD :-) as I demonstrated in the README and Doc.

Thanks for your information, I know that you're using v2.0.19 but I keep updating the source code often since the push of the original v2.0.19 onto PYPI. It could be that your version is a little bit behind the source code here. Thanks for your interest in wgd v2!

manoharbisht1998 commented 11 months ago

Hi, thanks for the confirmation, further I realized that I have installed the tool from the source code and its viz function looks like this; Could you please tell me what parameter helps me to shorten the y-axis interval? wgd peak -h

20:32:23 INFO This is wgd v2.0.19 cli.py:32 Usage: wgd peak [OPTIONS] KS_DISTRIBUTION

Infer peak and CI of Ks distribution.

Options: -ap, --anchorpoints TEXT anchor pair infomation -sm, --segments TEXT segments information -le, --listelements TEXT listelements information -mp, --multipliconpairs TEXT multipliconpairs information -o, --outdir TEXT output directory [default: wgd_peak] -af, --alignfilter FLOAT... filter alignment identity, length and coverage [default: 0.0, 0, 0.0]

-r, --ksrange FLOAT... range of Ks to be analyzed [default: 0, 5] -bw, --bin_width FLOAT bandwidth of distribution [default: 0.1] -ic, --weights_outliers_included include Ks outliers -m, --method [gmm|bgmm] mixture modeling method [default: gmm] --seed INTEGER random seed given to initialize parameters [default: 2352890]

-ei, --em_iter INTEGER number of EM iterations to perform [default: 200]

-ni, --n_init INTEGER number of initializations to perform [default: 200]

-n, --components ... range of number of components to fit [default: 1, 4]

-g, --gamma FLOAT gamma parameter for bgmm models [default: 0.001]

--boots INTEGER number of bootstrap replicates of kde [default: 200]

--weighted node-weighted instead of node-averaged method

-p, --plot [stacked|identical] plotting method [default: identical] -bm, --bw_method [silverman|ISJ] bandwidth method [default: silverman] --n_medoids INTEGER number of medoids to generate [default: 2] -km, --kdemethod [scipy|naivekde|treekde|fftkde] kde method [default: scipy] --n_clusters INTEGER number of clusters to plot Elbow loss function [default: 5]

--kmedoids K-Medoids clustering method -gd, --guide [multiplicon|basecluster|segment] regime residing anchors [default: segment] -prct, --prominence_cutoff FLOAT prominence cutoff of acceptable peaks [default: 0.1]

-kd, --kstodate FLOAT... range of Ks to be dated [default: 0.5, 1.5] -f, --family TEXT family to filter Ks upon --manualset Manually set Ks range of anchor pairs or multiplicons as CI

-rh, --rel_height FLOAT relative height at which the peak width is measured [default: 0.4]

--ci INTEGER confidence level of log-normal distribution to date [default: 95]

--hdr INTEGER highest density region (HDR) in a given distribution to date [default: 95]

--heuristic heuristic CI for dating -kc, --kscutoff FLOAT Ks Saturation cutoff for genes in Dating [default: 5]

--showci show CI for original anchor Ks gmm analysis -h, --help Show this message and exit.

heche-psb commented 11 months ago

Hi, I just updated v2.0.20, in which you can set your own xlim and ylim, for instance --xlim 0 3, --ylim 0 500. But the ytick will still follow the default ytick of matplotlib. Only the limit can be changed at this moment. I'm thinking which way to best realize it. Will optimize it in the future.

manoharbisht1998 commented 11 months ago

Hey, Thanks for the upgrade, could you please add that to wgd viz function too?

heche-psb commented 11 months ago

Hi, I just updated that option into wgd viz, but only in the source code here. If you want to use it, you can install wgd v2 from here. Thanks for your feedback! If you have other suggestions, feel free and welcomed to post here :-)

heche-psb / wgd

Wgd_syn error (File not found anchorpoints.txt) #3