jtlovell / GENESPACE

Other
192 stars 27 forks source link

Script stuck at step 3. Combining and annotating the blast files with orthogroup info ... #81

Closed bbista closed 1 year ago

bbista commented 1 year ago

Hello, I was wondering what the run time for the sample data is. My script has been stuck in one of the steps and it usually times out after that. I am trying to run the sample data just as the readme states. I have not modified the scripts in anyway.

human: n unique sequences = 20678, n matched to gff = 20678
chicken: n unique sequences = 18093, n matched to gff = 18093
Checking Working Directory ... PASS: `/work/LAS/nvalenzu-lab/bbista/genespace`
Checking user-defined parameters ...
    Genome IDs & ploidy ... 
        chicken: 1
        human  : 1
    Outgroup ... NONE
    n. parallel processes ... 16
    collinear block size ... 5
    collinear block search radius ... 25
    n gaps in collinear block ... 5
    synteny buffer size... 100
    only orthogroups hits as anchors ... TRUE
    n secondary hits ... 0
Checking annotation files (.bed and peptide .fa):
    chicken: 18093 / 18093 geneIDs exactly match (PASS)
    human  : 20678 / 20678 geneIDs exactly match (PASS)
Checking dependencies ...
    Found valid path to OrthoFinder v2.54: `orthofinder`
    Found valid path to DIAMOND2 v2.015: `diamond`
    Found valid MCScanX_h executable: `/work/LAS/nvalenzu-lab/bbista/Software/MCScanX//MCScanX_h`

############################
1. Running orthofinder (or parsing existing results)
    Checking for existing orthofinder results ...
        Copying files over to the temporary directory:
                /work/LAS/nvalenzu-lab/bbista/genespace/tmp
        Running the following command in the shell: `orthofinder -f
                /work/LAS/nvalenzu-lab/bbista/genespace/tmp -t 16 -a 1
                -X -o
                /work/LAS/nvalenzu-lab/bbista/genespace/orthofinder`.This
                can take a while. To check the progress, look in the
                `WorkingDirectory` in the output (-o) directory

    OrthoFinder version 2.5.4 Copyright (C) 2014 David Emms

    2023-03-27 03:13:14 : Starting OrthoFinder 2.5.4
    16 thread(s) for highly parallel tasks (BLAST searches etc.)
    1 thread(s) for OrthoFinder algorithm

    Checking required programs are installed
    ----------------------------------------
    Test can run "mcl -h" - ok
    Test can run "fastme -i /work/LAS/nvalenzu-lab/bbista/genespace/orthofinder/Results_Mar27/WorkingDirectory/SimpleTest.phy -o /work/LAS/nvalenzu-lab/bbista/genespace/orthofinder/Results_Mar27/WorkingDirectory/SimpleTest.tre" - ok

    Dividing up work for BLAST for parallel processing
    --------------------------------------------------
    2023-03-27 03:13:15 : Creating diamond database 1 of 2
    2023-03-27 03:13:15 : Creating diamond database 2 of 2

    Running diamond all-versus-all
    ------------------------------
    Using 16 thread(s)
    2023-03-27 03:13:15 : This may take some time....
    2023-03-27 03:16:42 : Done all-versus-all sequence search

    Running OrthoFinder algorithm
    -----------------------------
    2023-03-27 03:16:42 : Initial processing of each species
    2023-03-27 03:16:47 : Initial processing of species 0 complete
    2023-03-27 03:16:51 : Initial processing of species 1 complete
    2023-03-27 03:16:54 : Connected putative homologues
    2023-03-27 03:16:55 : Written final scores for species 0 to graph file
    2023-03-27 03:16:56 : Written final scores for species 1 to graph file

    WARNING: program called by OrthoFinder produced output to stderr

    Command: mcl /work/LAS/nvalenzu-lab/bbista/genespace/orthofinder/Results_Mar27/WorkingDirectory/OrthoFinder_graph.txt -I 1.5 -o /work/LAS/nvalenzu-lab/bbista/genespace/orthofinder/Results_Mar27/WorkingDirectory/clusters_OrthoFinder_I1.5.txt -te 1 -V all

    stdout
    ------
    b''
    stderr
    ------
    b'[mcl] cut <1> instances of overlap\n'
    2023-03-27 03:17:02 : Ran MCL

    Writing orthogroups to file
    ---------------------------
    OrthoFinder assigned 35152 genes (90.7% of total) to 13672 orthogroups. Fifty percent of all genes were in orthogroups with 2 or more genes (G50 was 2) and were contained in the largest 5789 orthogroups (O50 was 5789). There were 13053 orthogroups with all species present and 10896 of these consisted entirely of single-copy genes.

    2023-03-27 03:17:43 : Done orthogroups

    Analysing Orthogroups
    =====================

    Calculating gene distances
    --------------------------
    2023-03-27 03:17:54 : Done
    2023-03-27 03:17:54 : Done 0 of 1105
    2023-03-27 03:17:54 : Done 100 of 1105
    2023-03-27 03:17:55 : Done 200 of 1105
    2023-03-27 03:17:55 : Done 300 of 1105
    2023-03-27 03:17:55 : Done 400 of 1105
    2023-03-27 03:17:55 : Done 500 of 1105
    2023-03-27 03:17:56 : Done 600 of 1105
    2023-03-27 03:17:56 : Done 700 of 1105
    2023-03-27 03:17:56 : Done 800 of 1105
    2023-03-27 03:17:56 : Done 900 of 1105
    2023-03-27 03:17:57 : Done 1000 of 1105

    Inferring gene and species trees
    --------------------------------

    Reconciling gene trees and species tree
    ---------------------------------------
    2023-03-27 03:17:58 : Starting Recon and orthologues
    2023-03-27 03:17:58 : Starting OF Orthologues
    2023-03-27 03:17:58 : Done 0 of 1105
    2023-03-27 03:17:59 : Done 100 of 1105
    2023-03-27 03:18:00 : Done 200 of 1105
    2023-03-27 03:18:00 : Done 300 of 1105
    2023-03-27 03:18:01 : Done 400 of 1105
    2023-03-27 03:18:02 : Done 500 of 1105
    2023-03-27 03:18:02 : Done 600 of 1105
    2023-03-27 03:18:03 : Done 700 of 1105
    2023-03-27 03:18:03 : Done 800 of 1105
    2023-03-27 03:18:04 : Done 900 of 1105
    2023-03-27 03:18:04 : Done 1000 of 1105
    2023-03-27 03:18:05 : Done 1100 of 1105
    2023-03-27 03:18:05 : Done OF Orthologues

    Writing results files
    =====================
    2023-03-27 03:18:06 : Done orthologues

    Results:
        /work/LAS/nvalenzu-lab/bbista/genespace/orthofinder/Results_Mar27/

    CITATION:
     When publishing work that uses OrthoFinder please cite:
     Emms D.M. & Kelly S. (2019), Genome Biology 20:238

     If you use the species tree in your work then please also cite:
     Emms D.M. & Kelly S. (2017), MBE 34(12): 3267-3278
     Emms D.M. & Kelly S. (2018), bioRxiv https://doi.org/10.1101/267914
############################
2. Combining and annotating bed files w/ OGs and tandem array info ...
    ##############
    Flagging chrs. w/ < 10 unique orthogroups
    ...chicken: 475 genes on 55 small chrs. 
    ...human  :   7 genes on  5 small chrs. 
    ##############
    Flagging over-dispered OGs
    ...chicken: 224 genes in  7 OGs hit > 8 unique places 
    ...human  : 470 genes in 15 OGs hit > 8 unique places 
    ##############
    Annotation summaries (after exclusions):
    ...chicken: 17433 genes in 15260 OGs || 2151 genes in 419 arrays
    ...human  : 20202 genes in 15984 OGs || 3449 genes in 851 arrays

############################
3. Combining and annotating the blast files with orthogroup info ...
    # Chunk 1 / 1 (03:18:10 AM) ... 
    ...human v. human:     total hits = 221075, same og = 64737
    ...chicken v. chicken: total hits = 190677, same og = 58525
    ...human v. chicken:   total hits = 224950, same og = 16935
    ##############
    Generating dotplots for all hits ... 

This is the furthest I have gotten. It gets stuck in this for day until the script times out.

Any help is appreciated.

Best, Basanta

jtlovell commented 1 year ago

huh - did you get any dotplots in /dotplots? also, what GENESPACE version are you using?

bbista commented 1 year ago

I am using GENESPACE v1.1.4. The /dotplots folder is empty,

Best, B

jtlovell commented 1 year ago

Please update to the most recent release (1.1.8) and try run_genespace() again ... let me know how it goes.

detach("package:GENESPACE", unload = TRUE)
devtools::install_github("jtlovell/GENESPACE@v1.1.8", upgrade = F)
library(GENESPACE)
bbista commented 1 year ago

Hello, I have started the run but I am getting unusual error messaged with this version:

GENESPACE v1.1.8: synteny and orthology constrained comparative
        genomics
Error in match_fasta2gff(path2fasta = fa, path2gff = gf, genespaceWd = genespaceWd,  : 
  some of the peptides have '.' or '-' in the sequence. Orthofinder can't handle this.
Calls: parse_annotations -> rbindlist -> lapply -> FUN -> match_fasta2gff
Execution halted

Again, this is using the sample data with the provided script.

jtlovell commented 1 year ago

ok! the new diagnostics work! OrthoFinder cannot take peptide fastas with "." or "-" symbols. This likely causes a quiet error out. Remove those characters from your peptides and try again.

bbista commented 1 year ago

These are not my peptide fastas. I am using the sample data.

jtlovell commented 1 year ago

really? hmm. ok, one sec.

jtlovell commented 1 year ago

I see the error. will correct shortly.

jtlovell commented 1 year ago

Its not an error ... the chicken genome has a "-"

>RHOA
-NRKHPREVDPGSEAFLSQRAYHLGRKQEGPEE*RAHKTRAGQNEA...PHRCIWIYGVFGKDQRRCEGGF*NGH*SCFASPAWQEKVRVPSLI

I guess thats standard for NCBI sometimes. I think I'll need to add a method to remove these (even though I don't want to) ... should have an update pushed later today.

bbista commented 1 year ago

Thank you.

jtlovell commented 1 year ago

OK, this bug should be fixed. It is now the default to remove . and - characters from the peptide files when parsing. Then number of these is reported. Please give it a try and let me know if it works. Update your install with:

detach("package:GENESPACE", unload = TRUE)
devtools::install_github("jtlovell/GENESPACE", upgrade = F)
library(GENESPACE)

This will install v1.1.9

jtlovell commented 1 year ago

v1.1.10 is pushed to master and built as the latest release. I'm gonna close this issue, since the new release should address it. If this isn't the case, please re-open. Thanks!

bbista commented 1 year ago

Hello,

It seems the original issue has not been resolved. I am still stuck at step 3.

`

############################

  1. Combining and annotating the blast files with orthogroup info ...

    Chunk 1 / 1 (12:14:13 AM) ...

    ...human v. human: total hits = 221061, same og = 64747 ...chicken v. chicken: total hits = 190673, same og = 58479 ...human v. chicken: total hits = 224945, same og = 16863 ############## Generating dotplots for all hits ... `

jtlovell commented 1 year ago

what's your system and are you seeing any /dotplots generated? this should just take a minute or two at most.

bbista commented 1 year ago

The /dotplots directory is empty. I am running this on a server running Enterprise Linux 9

jtlovell commented 1 year ago

hmm. can you post your full script? I'm not able to re-create this issue on my linux hpc or my mac.

bbista commented 1 year ago

library(GENESPACE)

###############################################

-- change paths to those valid on your system

genomeRepo <- "/work/LAS/nvalenzu-lab/bbista/Example_genespace" wd <- "/work/LAS/nvalenzu-lab/bbista/genespace" path2mcscanx <- "/work/LAS/nvalenzu-lab/bbista/Software/MCScanX/" ###############################################

-- download raw data from NCBI for human and chicken genomes

dir.create(genomeRepo) rawFiles <- download_exampleData(filepath = genomeRepo)

-- parse the annotations to fastas with headers that match a gene bed file

parsedPaths <- parse_annotations( rawGenomeRepo = genomeRepo, genomeDirs = c("human", "chicken"), genomeIDs = c("human", "chicken"), presets = "ncbi", genespaceWd = wd)

-- initalize the run and QC the inputs

gpar <- init_genespace( wd = wd, path2mcscanx = path2mcscanx)

-- accomplish the run

out <- run_genespace(gpar)

bbista commented 1 year ago

I should be identical to the example one in the readme.

jtlovell commented 1 year ago

I'm not sure ... I just ran this exact code on both machines from scratch and it ran through without any trouble. Maybe reinstall your conda env and GENESPACE? run_test13Apr.pdf

kaede0e commented 1 month ago

Hi @jtlovell,

I am encountering the exact same issue as seen here, and I was wondering if you could help me troubleshoot. Due to my Orthofinder v2.5.4 installed on my Linux cluster about 2 years ago or so with the StdEnv/2020, I am using R v4.1, and installed other dependent R packages to be compatible with that version. MCScanX was installed from the Github master branch just this month. GENESPACE version is 1.3.1. My other installed program versions are: gcc/9.3.0 python/3.8 java/17.0.2 diamond/2.0.15. Do you know if this issue of being stuck at "generating dotplots" is some kind of installation issue? Is there a solution to it?

Thank you, Kaede

Update: I downloaded everything in R v4.3 instead, and it now works with no problems. I don't know what the issue was in the previous installation...