heche-psb / wgd

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication
https://wgdv2.readthedocs.io/en/latest/
GNU General Public License v3.0
21 stars 0 forks source link

Error running dmd #20

Closed hyyuu closed 4 months ago

hyyuu commented 5 months ago

Hello , thank you very much for the program ! However , I faced an error at the first step running wgd dmd.

Traceback (most recent call last):
  File "/home/hiuyan/.pyenv/versions/3.6.12/bin/wgd", line 11, in <module>
    load_entry_point('wgd==2.0.24', 'console_scripts', 'wgd')()
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/cli.py", line 113, in dmd
    _dmd(**kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/cli.py", line 140, in _dmd
    s[0].get_paranome(inflation=inflation, eval=eval)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgd/core.py", line 410, in get_paranome
    mcl_out = gf.run_mcl(inflation=inflation)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgd/core.py", line 495, in run_mcl
    out = sp.run(command, stdout=sp.PIPE, stderr=sp.PIPE)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mcxload': 'mcxload'

I tried with two different cds files, they all returned the same error messages. Could you please help me to fix this error? I installed wgd v2 via pip.

Thank you very much for your help!

(P.S. Sorry that I posted this question on wgdv1 github page. Sorry for the inconvenience. )

Regards, Alex

heche-psb commented 5 months ago

Hi, thanks for using wgd v2! It's required to pre-install the program mcl to run the wgd dmd program. Could you please install the mcl and try again?

hyyuu commented 5 months ago

Thank you so much for your prompt reply! It works perfectly now. But I have one more question about the choice of the mixture model. May I ask why is elmm modeling used on paralogous KS values while gmm modeling is used on anchor KS values? I am sorry that I cannot find relevant answers online. Your answer would be greatly appreciated, or any recommendation of relevant papers would be greatly helpful as well.

Thank you very much for your help again!

Best, Alex

heche-psb commented 5 months ago

Hi, I would suggest reading the paper of ksrates which first implemented the ELMM algorithm on whole-paranome Ks data. In wgd v2, ELMM analysis in both node-averaged and node-weighted manner is available by the wgd viz program. The ELMM method on whole-paranome Ks data is based on the assumption that small-scale gene duplication (and loss, SSDL) is the main source of gene duplicates in a genome without WGDs while an additional WGD will lead to the burst of gene duplicates and thus leave a Ks peak on the Ks distribution. Such that the exponential component is to model the background Ks distribution of SSDL while the normal components are to model the further WGDs. The Ks distribution of anchor pairs, on the other hand, is assumed to be only impacted by WGDs, upon which only normal components are thus needed in the mixture model to delineate WGDs. Thanks for your suggestions and I will update doc later to make it clearer!

hyyuu commented 5 months ago

Hello, thank you for your detailed reply! I will check out the paper later. I am sorry that I have one more question about interpreting the Ks distributions of paralogs. I am not sure whether the inferred components are WGD or just over-fitting.

Below the component b ( red line) has a peak near K=0. Should it be considered a potential WGD event or just a burst of recent gene duplication ( or the recent gene duplication should be represented by the blue line, component a)? As the Ks distribution is L-shape.

image

Below is another slightly different scenario, the number of duplicates is slightly higher at K=0.1 than K=0.0. Should the red line (component b, Ks= 0.11/0.13) be interpreted as WGD here? lignic parsp

image

And I observed an interesting result below. Two components, b and c, record peaks at 0.03 and 0.16. Does it mean there are two potential WGDs here or should be considered as over-fitting?

image

Sorry for the lengthy question. I greatly appreciate your kind help! Thank you.

Regards, Alex

heche-psb commented 5 months ago

Hi, interesting results. Could you please share or tell me that what is the GMM result of the anchor Ks distribution? If the anchor Ks distribution also has the very recent Ks peak, then you'd better further check the collinear dot plot (with Ks values denoted) to see if there are really massive collinear genes of very recent age. On the other side, young Ks peaks (near zero) are frequently present in transcriptome-derived Ks distributions due to redundancy and alternative splicing. Cautions need to be thus taken for the interpretation of the mixture modeling result of transcriptome-derived Ks distributions.

hyyuu commented 5 months ago

Hello ! Thank you for your explanation! That's very helpful ! For all these four species, they are tanscriptome-derived Ks distributions. Yet, I ran CD-HIT on the cds file to remove redundancy before running wgd. Are these near-zero peaks still caused by the redundancy? For these graphs, I should consider there is no WGD for now?

I did several more species with transcriptomes. Around 80% of these Ks-distributions exhibit a normal L-shape. Is it because the level of redundancy differs between transcriptomes?

I only have four interested species that have annotations. I tried to the anchor Ks distribution. But no Ks plot is generated after wgd ksd. In the log file, it said :

 WARNING  No codeml result for GF00008501 due to no        codeml.py:291
                  resolved nucleotides                                          
         WARNING  No codeml result for GF00008499 due to no        codeml.py:291
                  resolved nucleotides                                          
         WARNING  No codeml result for GF00008502 due to no        codeml.py:291
                  resolved nucleotides                                          
16:26:58 INFO     Saving to                                           cli.py:519
                  wgd_ksd/Amphibalanus_amphitrite_GCF_019059575.1_NRL           
                  GWU_Aamphi_draft_cds_from_genomic.fna.tsv.ks.tsv              
16:27:02 INFO     Making plots                                        cli.py:521
         INFO     No valid Ks values for plotting                     cli.py:523

I got the same message for all four species with the data downloaded from NCBI RefSeq. Could you please tell me how to fix this? The ftp link of one of my species is below. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/764/305/GCF_000764305.2_Hazt_2.0.2/

Thank you very much again for answering my questions! It really helps me to interpret the result accurately!

Warm regards, Alex

heche-psb commented 5 months ago

I tried with GCF_000764305.2_Hazt_2.0.2_cds_from_genomic.fna in this directory which gave me some Ks results. Just to make sure we're using the same commit, I updated the version 2.0.25 just now. Could you please install the latest version and try again? On the other hand, the peaks at Ks= 0.11/0.13/0.16 seem to be indicative of WGDs, despite that it's impossible to claim that the redundancy or alternative splicing is completely eliminated due to the nature of transcriptome. Maybe you can share me with the dotplot? and let's see how much percentages of genomic regions are duplicated (as we would expect many-to-most genomic regions being duplicated due to the young WGD age).

hyyuu commented 5 months ago

Hello! Thank you for checking the data! I upgraded wgd, now it can output Ks value. For three species. However, it still said INFO No valid Ks values for plotting cli.py:523 So, there is no output of Ks plot after wgd ksd.

On the other hand , for this species, I have no Ks value output again. Is it because of the quality of assembly? https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/059/575/GCF_019059575.1_NRLGWU_Aamphi_draft/

Then I continue to run wgd syn for anchor Ks distribution. It failed again. I tried -f with CDS and mRNA. Both have the same error. Should I parse the gff file for this step? Or did I do something wrong ?

         ERROR    No genes from families file                         cli.py:701
                  `wgd_dmd/GCF_011947565.3_Ppol_2.1_cds_from_genomic.           
                  fna.tsv` found in the GFF file for `feature=CDS`              
                  and `attribute=Name`, please double check command             
                  settings.        

For the transcriptomic KS distribution, I am sorry that they only have transcriptomes available. There is no genomic data for making dot plots. I only have their BUSCO scores of the transcriptomes ( Duplicate score ranges from 15 to 30, for those Ks= 0.11/0.13/0.16). Is there other way to check these duplicated genomic regions ?

Thank you for your kind help again!

Warm regards, Alex

hyyuu commented 5 months ago

Hello, I just tried again with wgd syn for the anchor distribution. I parse the sequence to be same as the "Name" as stated in gff. i.e., >XP_xxxx.X, and I ran a CD-Hit on the NCBI Refseq cds file. This time, wgd ksd ran smoothly and Ks plot is generated. However, I got another error at wgd syn, unlike what I replied in the last comment.

nohup: ignoring input
23:44:37 INFO     This is wgd v2.0.25                                  cli.py:32
23:44:40 INFO     Note: NumExpr detected 64 cores but               utils.py:147
                  "NUMEXPR_MAX_THREADS" not set, so enforcing safe              
                  limit of 8.                                                   
         INFO     NumExpr defaulting to 8 threads.                  utils.py:159
         INFO     Configuring I-ADHoRe co-linearity search            cli.py:708
         INFO     Writing families file                                syn.py:96
         INFO     Writing gene lists                                   syn.py:98
         ERROR    There are duplicated gene IDs for given feature and syn.py:121
                  attribute                                

Sorry for throwing two comments at the same time. Thank you very much for your kind help again !

Warms regards, Alex

hyyuu commented 5 months ago

Hello, I am sorry for bothering you. I still could not solve the bugs above after additional runs. Do you have any idea about this? Your help will be greatly appreciated!

Best, Alex

heche-psb commented 5 months ago

Hi, there seems to be duplicated gene IDs in your cds file, which is not allowed. It's mandatory to keep each gene ID unique and findable in the gff3 file.

hyyuu commented 4 months ago

Hello ! Thank you very much for your help ! I successfully ran the anchor pair analyses on a few species now !

However, when I tried on a non-chromosome level assembly, it returned an error at wgd syn

15:59:52 INFO     This is wgd v2.0.25                                  cli.py:32
15:59:54 INFO     Note: NumExpr detected 64 cores but               utils.py:147
                  "NUMEXPR_MAX_THREADS" not set, so enforcing safe              
                  limit of 8.                                                   
         INFO     NumExpr defaulting to 8 threads.                  utils.py:159
         INFO     Configuring I-ADHoRe co-linearity search            cli.py:708
         INFO     Writing families file                                syn.py:96
         INFO     Writing gene lists                                   syn.py:98
15:59:58 INFO     Writing config file                                 syn.py:100
16:00:03 INFO     Running I-ADHoRe                                    cli.py:712
16:00:10 WARNING  WARNING: Maximum allowed number of gaps in the      syn.py:186
                  alignment not specified.  Setting to cluster_gap.             
                  WARNING: Tandem gap size not correct in settings              
                  file. Using default (gap_size / 2)                            

         INFO                                                         syn.py:187
                  This is i-ADHoRe v3.0.                                        
                  Copyright (c) 2002-2010, Flanders Interuniversity             
                  Institute for Biotechnology, VIB.                             
                  Algorithm designed by Klaas Vandepoele, Cedric                
                  Simillion, Jan Fostier, Dieter De Witte,                      
                  Koen Janssens, Sebastian Proost, Yvan Saeys and               
                  Yves Van de Peer.                                             

                  Process 1/1 is alive on localhost.                            

                  ************* i-ADHoRe parameters *************               
                          Number of genelists = 1954                            
                          Blast table =                                         
                  /home/hiuyan/synteny/wgd/wgd_syn/families.tsv                 
                          Output path =                                         
                  /home/hiuyan/synteny/wgd/wgd_syn/iadhore-out/                 
                          Gap size = 30                                         
                          Cluster gap size = 35                                 
                          Cloud gap size = 0                                    
                          Cloud cluster gap size = 0                            
                          Max gaps in alignment = 35                            
                          Tandem gap = 15                                       
                          Flush output = 1000                                   
                          Q-value = 0.75                                        
                          Anchorpoints = 3                                      
                          Probability cutoff = 0.01                             
                          Cloud filtering method = Binomial                     
                          Level 2 only = false                                  
                          Use family = true                                     
                          Write statistics = false                              
                          Alignment method = GreedyGraphbased4                  
                          Multiple hypothesis correction = FDR                  
                          Number of threads = 1                                 
                          Compare aligners = false                              
                          Collinear searches only                               
                          Visualize GHM.png = false                             
                          Visualize Alignment = false                           
                          Verbose output = true                                 
                  ************ END i-AdDHoRe parameters *********               

                  Creating dataset...                     done.                 
                  (time: 0.033169s)                                             
                  Mapping gene families...                done.                 
                  (time: 0.0233521s)                                            
                  Remapping tandem duplicates...  done. (time:                  
                  0.005831s)                                                    
                  Writing genelists file...               done.                 
                  (time: 0.0541091s)                                            
                  Collinear Search                                              
                  Level 2 multiplicon detection...        done.                 
                  (time: 6.67852s)                                              
                  Profile detection...                                          
                  1 multiplicons to evaluate - evaluating level 2               
                  multiplicon... 0 new multiplicons found.                      
                  Flushing output files...done.                                 
                  Time for Higher Level Detection: 0.00259781s.                 

                  All Done!  Bye...                                             

         INFO     Processing I-ADHoRe output                          cli.py:716
         INFO     Dropped 524 scaffolds in hya_1.cds.fa because they viz.py:2701
                  are on scaffolds shorter than 5000                            
Traceback (most recent call last):
  File "/home/hiuyan/.pyenv/versions/3.6.12/bin/wgd", line 11, in <module>
    load_entry_point('wgd==2.0.25', 'console_scripts', 'wgd')()
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/cli.py", line 677, in syn
    _syn(**kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/cli.py", line 730, in _syn
    segs = filter_mingenumber(segs_gene_unit,mingenenum,outdir,len(gene_order_dict_allsp))
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgd/viz.py", line 2740, in filter_mingenumber
    profile = counted.unstack(level=-1).fillna(0)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/pandas/core/series.py", line 3903, in unstack
    return unstack(self, level, fill_value)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 425, in unstack
    obj.index, level=level, constructor=obj._constructor_expanddim,
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 92, in __init__
    self.index = index.remove_unused_levels()
AttributeError: 'Index' object has no attribute 'remove_unused_levels'

Is it because of the relatively low quality of the used genome here ? This is the genome assembly that I used. https://www.ncbi.nlm.nih.gov/assembly/GCF_000764305.2/?shouldredirect=false

I also have another question about the result from the original anchor pair and the segment-guided Ks. For more ancient WGD ( around Ks =2.0), can I say it is also acceptable to propose the incidence of WGDs using the segment-guided Ks result?

Thank you very much for your help again ! Have a nice weekend.

Best, Alex

heche-psb commented 4 months ago

Hi, could you please share me the gff3 file and the gene family file that you used exactly? Besides, the segment Ks (the median of all residing syntelog Ks values) and anchor pair Ks are supposed to evolve at different rates, while the former of which reflects a more conserved estimate of the age of duplication event. Both are authentic remnants indicative of (ancient) WGD events. To answer your question, my opinion is that it is credible to speculate an ancient WGD event upon both Ks data, but the interpretation has to be made with caution of Ks saturation (when it's over 2).