jtlovell / GENESPACE

Other
184 stars 24 forks source link

Error during `convert_input2v1` and downstream errors without clear error message #76

Closed pascalangst closed 1 year ago

pascalangst commented 1 year ago

Hi @jtlovell,

Switching from v0.9.3, I encountered some issues and hope to solve them with your help. Your latest release as well as today's master branch devtools::install_github("jtlovell/GENESPACE") did not entirely work for me.

Here my sessionInfo

R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/pascal/miniconda3/envs/mamba-genespace-116/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=C                 LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Biostrings_2.66.0   GenomeInfoDb_1.34.8 XVector_0.38.0     
[4] IRanges_2.32.0      S4Vectors_0.36.0    BiocGenerics_0.44.0
[7] GENESPACE_1.1.6    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10            pillar_1.8.1           compiler_4.2.2        
 [4] R.methodsS3_1.8.2      R.utils_2.12.2         bitops_1.0-7          
 [7] tools_4.2.2            zlibbioc_1.44.0        lifecycle_1.0.3       
[10] tibble_3.2.0           gtable_0.3.1           pkgconfig_2.0.3       
[13] rlang_1.1.0            igraph_1.4.1           cli_3.6.0             
[16] parallel_4.2.2         GenomeInfoDbData_1.2.9 vctrs_0.6.0           
[19] grid_4.2.2             glue_1.6.2             data.table_1.14.8     
[22] R6_2.5.1               fansi_1.0.4            ggplot2_3.4.1         
[25] magrittr_2.0.3         scales_1.2.1           colorspace_2.1-0      
[28] utf8_1.2.3             RCurl_1.98-1.10        munsell_0.5.0         
[31] crayon_1.5.2           dbscan_1.1-11          R.oo_1.25.0  

First, I tried to convert files from a previous run with the below code.

library(GENESPACE)
GENESPACE v1.1.6: synteny and orthology constrained comparative
        genomics
wd <- "genespace_116_liftover"
convert_input2v1(
  v1Dir = wd,
  existingDir = "genespace")
Error in if (!dir.exists(ofWd)) { : argument is of length zero

However, this gave me only the bed and peptide folder. The results folder was empty. I therefore decided to redo the orthology detection.

unlink("genespace_116_liftover/results/", recursive=T)
setwd("genespace_116_liftover/")

runwd = "."
gpar <- init_genespace(
  wd = runwd,
  path2mcscanx = "~/bioinformatics/MCScanX")

Checking Working Directory ... PASS: `.`
Checking user-defined parameters ...
    Genome IDs & ploidy ... 
        Cinc: 1
        Mani: 1
    Outgroup ... NONE
    n. parallel processes ... 12
    collinear block size ... 5
    collinear block search radius ... 25
    n gaps in collinear block ... 5
    synteny buffer size... 100
    only orthogroups hits as anchors ... TRUE
    n secondary hits ... 0
Checking annotation files (.bed and peptide .fa):
    Cinc: 49131 / 49131 geneIDs exactly match (PASS)
    Mani: 44695 / 44695 geneIDs exactly match (PASS)
Checking dependencies ...
    Found valid path to OrthoFinder v2.54: `orthofinder`
    Found valid path to DIAMOND2 v2.14: `diamond`
    Found valid MCScanX_h executable: `/home/pascal/bioinformatics/MCScanX/MCScanX_h`
out <- run_genespace(gpar)

############################
1. Running orthofinder (or parsing existing results)
    Checking for existing orthofinder results ...
        Copying files over to the temporary directory: ./tmp
        Running the following command in the shell: `orthofinder -f
                ./tmp -t 12 -a 1 -X -o ./orthofinder`.This can take a
                while. To check the progress, look in the
                `WorkingDirectory` in the output (-o) directory

    OrthoFinder version 2.5.4 Copyright (C) 2014 David Emms

    2023-03-16 07:49:18 : Starting OrthoFinder 2.5.4
    12 thread(s) for highly parallel tasks (BLAST searches etc.)
    1 thread(s) for OrthoFinder algorithm

    Checking required programs are installed
    ----------------------------------------
    Test can run "mcl -h" - ok
    Test can run "fastme -i ./orthofinder/Results_Mar16/WorkingDirectory/SimpleTest.phy -o ./orthofinder/Results_Mar16/WorkingDirectory/SimpleTest.tre" - ok

    Dividing up work for BLAST for parallel processing
    --------------------------------------------------
    2023-03-16 07:49:19 : Creating diamond database 1 of 2
    2023-03-16 07:49:20 : Creating diamond database 2 of 2

    Running diamond all-versus-all
    ------------------------------
    Using 12 thread(s)
    2023-03-16 07:49:20 : This may take some time....
    2023-03-16 07:59:14 : Done all-versus-all sequence search

    Running OrthoFinder algorithm
    -----------------------------
    2023-03-16 07:59:14 : Initial processing of each species
    2023-03-16 07:59:25 : Initial processing of species 0 complete
    2023-03-16 07:59:34 : Initial processing of species 1 complete
    2023-03-16 07:59:38 : Connected putative homologues
    2023-03-16 07:59:40 : Written final scores for species 0 to graph file
    2023-03-16 07:59:41 : Written final scores for species 1 to graph file

    WARNING: program called by OrthoFinder produced output to stderr

    Command: mcl ./orthofinder/Results_Mar16/WorkingDirectory/OrthoFinder_graph.txt -I 1.5 -o ./orthofinder/Results_Mar16/WorkingDirectory/clusters_OrthoFinder_I1.5.txt -te 1 -V all

    stdout
    ------
    b''
    stderr
    ------
    b'[mcl] cut <10> instances of overlap\n'
    2023-03-16 08:00:13 : Ran MCL

    Writing orthogroups to file
    ---------------------------
    OrthoFinder assigned 67139 genes (71.6% of total) to 20145 orthogroups. Fifty percent of all genes were in orthogroups with 2 or more genes (G50 was 2) and were contained in the largest 10032 orthogroups (O50 was 10032). There were 15856 orthogroups with all species present and 10289 of these consisted entirely of single-copy genes.

    2023-03-16 08:00:19 : Done orthogroups

    Analysing Orthogroups
    =====================

    Calculating gene distances
    --------------------------
    2023-03-16 08:00:40 : Done
    2023-03-16 08:00:41 : Done 0 of 3798
    2023-03-16 08:00:42 : Done 1000 of 3798
    2023-03-16 08:00:42 : Done 2000 of 3798
    2023-03-16 08:00:42 : Done 3000 of 3798

    Inferring gene and species trees
    --------------------------------

    Reconciling gene trees and species tree
    ---------------------------------------
    2023-03-16 08:00:44 : Starting Recon and orthologues
    2023-03-16 08:00:44 : Starting OF Orthologues
    2023-03-16 08:00:44 : Done 0 of 3798
    2023-03-16 08:00:48 : Done 1000 of 3798
    2023-03-16 08:00:49 : Done 2000 of 3798
    2023-03-16 08:00:50 : Done 3000 of 3798
    2023-03-16 08:00:51 : Done OF Orthologues

    Writing results files
    =====================
    2023-03-16 08:00:53 : Done orthologues

    Results:
        orthofinder/Results_Mar16/

    CITATION:
     When publishing work that uses OrthoFinder please cite:
     Emms D.M. & Kelly S. (2019), Genome Biology 20:238

     If you use the species tree in your work then please also cite:
     Emms D.M. & Kelly S. (2017), MBE 34(12): 3267-3278
     Emms D.M. & Kelly S. (2018), bioRxiv https://doi.org/10.1101/267914
############################
2. Combining and annotating bed files w/ OGs and tandem array info ...
    ##############
    Flagging chrs. w/ < 10 unique orthogroups
    ...Cinc: 26193 genes on 7227 small chrs. ***
    ...Mani: 20928 genes on 5450 small chrs. ***
        NOTE! Genomes flagged *** have > 5% of genes on small chrs.
                These are likely not great assemblies and should be
                examined carefully
    ##############
    Flagging over-dispered OGs
    ...Cinc: 6734 genes in 304 OGs hit > 8 unique places ***
    ...Mani: 5117 genes in 248 OGs hit > 8 unique places ***
        NOTE! Genomes flagged *** have > 5% of genes in over-dispersed
                orthogroups. These are likely not great annotations, or
                the synteny run contains un-specified WGDs. Regardless,
                these should be examined carefully
    ##############
    Annotation summaries (after exclusions):
    ...Cinc: 18868 genes in 13789 OGs || 3340 genes in 1373 arrays
    ...Mani: 20703 genes in 16097 OGs || 3708 genes in 1523 arrays

############################
3. Combining and annotating the blast files with orthogroup info ...
    # Chunk 1 / 1 (08:01:04 AM) ... 
    ...Cinc v. Cinc: total hits = 499035, same og = 208698
    ...Mani v. Mani: total hits = 418608, same og = 167693
    ...Cinc v. Mani: total hits = 499960, same og = 31706
    ##############
    Generating dotplots for all hits ... Done!

############################
4. Flagging synteny for each pair of genomes ...
    # Chunk 1 / 1 (08:01:20 AM) ... 
    ...Cinc v. Mani: 10535 hits (4580 anchors) in 484 blocks (0 SVs, 484 regions)
    ...Cinc v. Cinc: 79280 hits (48127 anchors) in 8254 blocks (0 SVs, 0 regions)
    ...Mani v. Mani: 72363 hits (43006 anchors) in 6456 blocks (0 SVs, 0 regions)

############################
5. Building synteny-constrained orthogroups ...
    Done!

############################
6. Integrating syntenic positions across genomes ...
    ##############
    Generating syntenic dotplots ... Done!
    ##############
    Interpolating syntenic positions of genes ... 
    Cinc:  (0 / 1 / 2 / >2 syntenic positions)
        Cinc:    95 / 48341 /     0 /     0
        Mani: 12895 /  5232 /   391 /     0
    Mani:  (0 / 1 / 2 / >2 syntenic positions)
        Cinc: 10536 /  6184 /   181 /     0
        Mani:   137 / 43513 /     0 /     0
    Done!

############################
7. Final block coordinate calculation and riparian plotting ...
    ##############
    Calculating syntenic blocks by reference chromosomes ... 
        n regions (aggregated by 25 gene radius): 6945
        n blocks (collinear sets of > 5 genes): 6945
    ##############
    Building ref.-phased blks and riparian plots for haploid genomes:
Error in `[.data.table`(bedm, , `:=`(ord, (1:.N)/.N), by = "refchr") : 
  Supplied 2 items to be assigned to group 1 of size 0 in column 'ord'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
In addition: Warning messages:
1: In as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names,  :
  Item 2 has 0 rows but longest item has 1; filled with NA
2: In as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names,  :
  Item 3 has 0 rows but longest item has 1; filled with NA
3: In as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names,  :
  Item 4 has 0 rows but longest item has 1; filled with NA

I'm using files from the species "Cinc" and "Mani":

head genespace_116_liftover/bed/Cinc.bed 
ptg000001l  24187   25403   Pcinc_v1.9_g25899.t1
ptg000001l  143877  150762  Pcinc_v1.9_g25900.t1
ptg000003l  20635   21058   Pcinc_v1.9_g35208.t1
ptg000003l  35787   36390   Pcinc_v1.9_g35209.t1
ptg000003l  49728   57066   Pcinc_v1.9_g35210.t1
ptg000003l  78791   79460   Pcinc_v1.9_g35211.t1
ptg000004l  65830   66945   Pcinc_v1.9_g9715.t1
ptg000004l  97739   100413  Pcinc_v1.9_g9716.t1
ptg000004l  127814  129696  Pcinc_v1.9_g9717.t1
ptg000004l  137352  137709  Pcinc_v1.9_g9718.t1

head genespace_116_liftover/bed/Mani.bed 
ptg000002l  20282   20828   Pmani_v1.7_g2216.t1
ptg000002l  22787   24635   Pmani_v1.7_g2217.t1
ptg000002l  25401   27259   Pmani_v1.7_g2218.t1
ptg000002l  27745   29683   Pmani_v1.7_g2219.t1
ptg000002l  30856   35510   Pmani_v1.7_g2220.t1
ptg000002l  38407   44831   Pmani_v1.7_g2221.t1
ptg000002l  39732   44831   Pmani_v1.7_6_t
ptg000002l  46226   56581   Pmani_v1.7_g2222.t1
ptg000002l  56859   58428   Pmani_v1.7_g2223.t1
ptg000002l  69838   73666   Pmani_v1.7_g2224.t1

head genespace_116_liftover/peptide/Cinc.fa 
>Pcinc_v1.9_g25899.t1
MVRGEQKKRGHNNERRKNTQPTRERENTQPTKERTHNQREKREREHAHNERRRRENTHNQRERRGNREEREHTTNERRER
EHAHNKREKKREHSTNERRRENTHNERRRENPHNERRRENTHNQ
>Pcinc_v1.9_g25900.t1
KGQRIPLPLCISGQHKSVFHDLFCSIANSPPAESDLLYPSLYLSVGDLVCLGDCVRRSKHLTRLKLPNLKSIESYMPILL
ALGGSNAIQTLRIDSNTVSVWDKAFQLAVTSLRHNSTLCTLSLAHWTFDLQVSESGRERECESVYKLCEKVKKV
>Pcinc_v1.9_g35208.t1
MPRYVIMLYHTTTLRQHTHHTKSRHTTTPYTHHYTTTHLIPPHYNTPHHNTPHHTTPQHTTPQHTISHHTTTHYTTTHHN
KSHHNSLHHTTPQHTTLQHTTPQHTTTHHTTTHYTTPRYALYMKTRVNGKWMQEERESIE
>Pcinc_v1.9_g35209.t1

head genespace_116_liftover/peptide/Mani.fa 
>Pmani_v1.7_g2216.t1
MLAFCTFLGFVGLWFKNDEPEIDSSEMGLRKDGHKDGPNHIVGGLDKSRTNHPVAQVAEPDGQDHNNQIIHVNDEADRLP
LPHLPPNPQPFQPPQQNQQQRAGPPRLPSNEDVREKMKAERWFQEDDMKVVPGLGEGGKAVRLTGEDGRRAQEIIKKEAF
NLIASDKISLNRSVPDSRDSL
>Pmani_v1.7_g2217.t1
MEAHTNWNPENEVDIIDVLQESANFPTGEMEDLLDEEKCLSLTRPNLTEEDTIEECIIKLQQQTKTLPRSIWPDKQLLET
NYKCTRPKLFLGVVLLSEARVLEVCSGPHQEYLKTVHGELVGDFGDSRMYKCELRVDKPKDALWLKLKSTYHADSVWIYG
LHVVTTLDTKSTEMESRFSHAHLSSILQEKDVQMSKDATKFRNMLEAFNSAKSQSDEINGNNPMAFMSMLLGPAISGKSF
SPHPSMQGRNQGSCKKKSVHMERDNECQSGKRDYSSEGYEQKIENPNNSSASVCSDKEMVSDILRKNMTSLSIKIDSAES
QGYKEKREVGEPSAKYITEISNLKQQQDNIAKQFFTLMDNRGGSQMEGEMLEHVMKMAAKMNNGKVERLANGLESDITSS

The resulting dotplots and riparian plot are all 3.6 KB and cannot be displayed because they "don't have pages". Let me know what is needed to troubleshoot this error. I'm happy to share any file you need.

Cheers, Pascal

jtlovell commented 1 year ago

Hi there. I'm sure this issue comes from the contiguity of the genomes ... for example, your 'Cinc' genome has 49,131 genes. Of these, 26,193 are on scaffolds with < 10 unique orthogroups. In total, GENESPACE can only "see" about 36% of the genes in that genome.

Nonetheless, GENESPACE should report a more informative error than just breaking during syntenic block calculation. Would you mind sharing your /bed and /peptide directories? If so, shoot me an email jlovell[at]hudsonalpha[dot]org.

Thanks, John

pascalangst commented 1 year ago

Hi John,

I've sent you the folders. It is true that the genomes are not very contiguous (HiFi, but no further scaffolding). Is it possible to get the dotplots/riparian plot from our data, as it was possible with the previous version of GENESPACE?

Thanks, Pascal

jtlovell commented 1 year ago

Thanks! I have the files and will get this ironed out. I should have a fix pushed to master today.

I will make the following changes to the source code: 1) If there are more than 100 "chromosomes" with synteny, only plot the 100 largest (currently if there are more than 10,000 chr-chr combinations it doesn't make a dotplot). Riparian plot will still push through but it will look bad. 2) If there are more than 100 chromosomes with synteny, do not try to phase by reference genome chromosomes, and do not color the riparian plot (this is what is causing the error that you observe). 3) report warnings and context for 1-2, telling the user to use query_hits and/or ggdotplot to make the dotplots and set plot_riparian(..., minChrLen2plot = 0) to see all scaffolds. But note that you'll need to make very large graphics windows to see these since you have so many gaps between small scaffolds.

Note that GENESPACE really is designed for chromosome-scale genomes. Synteny becomes less useful of a tool when you don't have long runs of genes on single chromosomes. I will add a note regarding this in the readme and in the warnings returned during genome QC.

jtlovell commented 1 year ago

OK - I have a fix pushed to master now as v1.1.7.

Now, if you give genespace very broken up genomes, it will still make dotplots (but simple base plots so that facetting doesn't obscure the actual points) and pan-gene sets (but will return a warning for the latter). It will not make riparian plots if the reference genome has > 100 chromosomes in the synteny network, and returns a warning saying why.

In this particular case, I am not sure that these genomes are good candidates for GENESPACE ... I can't seen any real synteny in the dotplots, likely because there aren't any chromosomes with enough genes on them.

pascalangst commented 1 year ago

This all sounds reasonable to me. Especially the messages, which tell the user what is going on under the hood.

For our case, we have used a combined approach for obtaining candidate contigs (large contig-to-contig sequence homology, relatively high number of shared number of single copy orthologues, ...) and used genespace mainly for visualization. We have used the options onlyTheseRegions = regions_of_interest, excludeNoRegChr=T. I suspect there are options like this in versions v1.x (?). Your plots are very appealing.

I will give 1.1.7 a shot later today and share my experience. Thank you!

pascalangst commented 1 year ago

v1.1.7 works for me. Thank you!

The options highlightBed = regions_of_interest_bed, backgroundColor = NULL result in the same sort of plot I described above. In the description, it is specified that useRegions = FALSE will result in usage of "collinear syntenic blocks". Does that mean the genes are in the same order and orientation in all species? Generally, what is the difference between "aggregated syntenic regions" and "collinear syntenic blocks"?

jtlovell commented 1 year ago

Ok. good to hear! You can always include a related genome that is chromosome level, then do everything anchored to that. It would let you make plots and a reasonable pan-gene set.