nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

strain level phylogeny #160

Closed AnotherSimon closed 5 years ago

AnotherSimon commented 6 years ago

For my current project I'm working on a comparison of various strains within the same species. I used the --strain option to annotate each of the genomes to be compared.

When running funannotate compare most of the outputs are unusable because the naming stops at the species level and I suspect it is also messing with some of the output files due to their naming scheme (for example: ./WorkingDir/funannotate_compare/go_terms/Genus_species.txt)

Update: In a likely related matter, I'm getting the following error for ProteinOrtho5 step:

... [Mar 27 04:36 PM]: Summarizing fungal transcription factors [Mar 27 04:36 PM]: Running GO enrichment for each genome [Mar 27 04:36 PM]: Running orthologous clustering tool, ProteinOrtho5. This may take awhile... Traceback (most recent call last): File "/home/simon/software/funannotate/bin/funannotate-compare.py", line 814, in SeqTranscripts = SeqIO.index(AllTrans, 'fasta') File "/home/simon/.local/lib/python2.7/site-packages/Bio/SeqIO/init.py", line 885, in index key_function, repr, "SeqRecord") File "/home/simon/.local/lib/python2.7/site-packages/Bio/File.py", line 306, in init raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'MyBugStrainNum_000001-T1'

However a search in ./WorkingDir/funannotate_compare/protortho did not yield any duplicate results for "MyBugStrainNum_000001-T1" within any of the human-readable files.

Update2: Also found in funannotate-compare.log:

Step 1 Generating indices Building database for 'My_bug.faa' (16332 sequences) Error: NCBI C++ Exception: "/home/simon/software/ncbi-rmblastn-2.2.28-src/c++/src/objtools/blast/seqdb_writer/build_db.cpp", line 304: Error: ncbi::s_FixBioseqDeltas() - Protein delta sequences are not supported.

I strongly suspect this is because the different strains are re-added as the same species.

nextgenusfs commented 6 years ago

Okay, I think I can add an arbitrary number to genomes having same name - but if you added --strain info it should be using that as well so probably a bug. The other issue might be if the genomes were all annotated with default settings under --name in funannotate predict, i.e. every model has FUN_ as the locus tag. But I don't think that is happening as there should be a warning to the user.

It seems that perhaps proteinortho is trying to use rmbastn instead of blastp, not sure how that is happening? Can you try to run the proteinortho test data (I think it is packaged with some test data but might depend on how you installed) and make sure it is functioning?

Sorry about all the bugs - I'm grateful you keep finding them - and I will be a little bit slower than usual for next week or so on getting things fixed.

Jon

AnotherSimon commented 6 years ago

Glad to help you idiot-proof the code and good to hear you're not tiered of fixing them.

I reran funannotate annotate again and specified the --strain option. Seems to have fixed the naming issue for funannotate compare but I still had to manually rename the locus tags, for some reason --rename didn't work. (I also did the manual edit the first time around.)

Previously I just assumed that the strain info would be inherited from funannotate train and/or funannotate predict just like --species is. (If I remember correctly the suggested command generated at the end did not include this info.)

As for proteinortho5, test seems successful:

$ make test g++ -Wall -O2 -std=gnu++11 -Wno-unused-result -o proteinortho5_clustering proteinortho5_clustering.cpp ./proteinortho5.pl -project=test -synteny -singles test/*.faa


Proteinortho with PoFF version 5.16b - An orthology detection tool


Detected 20 available CPU threads, Detected NCBI BLAST version 2.2.27+ Checking input files Checking test/C.faa... test/C.faa 109 genes ok Checking test/E.faa... test/E.faa 72 genes ok Checking test/L.faa... test/L.faa 40 genes ok Checking test/M.faa... test/M.faa 40 genes ok

Step 1 Generating indices Building database for 'test/C.faa' (109 sequences) Building database for 'test/E.faa' (72 sequences) ... Cleaning edge list... Removed 0 / 104 edges Done. [OUTPUT] -> written to test.poff-graph

All finished. rm test.* Test okay

AnotherSimon commented 6 years ago

Was debating listing the following issues as a separate thread but the title of this one seems to appropriate not to use. Alright what I've found so far, in order of folder names:

AnotherSimon commented 6 years ago

Hi John,

Any progress on the proteinortho bug? Is it reproducible on your end? Any info you need from me?

nextgenusfs commented 6 years ago

I’m away from my computer for a few days but I did add the explicit path to blastp to the proteinortho command, is it still trying to use rmblast? I thought I fixed the naming issue as well but can’t remember for sure off the top of my head.

AnotherSimon commented 6 years ago

ok, I'll update my local copy to current master branch and try again (or is there a specific commit that you were thinking of?)

nextgenusfs commented 6 years ago

I think the blast path was awhile ago https://github.com/nextgenusfs/funannotate/commit/f0b37df5edadfd4980114c451f17f02c80be9d2d and I may not have fixed the incremental naming of files in compare (I’m on my phone so hard to look through the commits). I guess if you can get the latest version and let me know if problem still exists I can try to address this week.

AnotherSimon commented 6 years ago

When the bug occurred I was running 1.3.0-beta but that branch seems to have been pulled.

Edit: nvmnd, the current version is also 1.3.0 but this is not reflected in the README.md

nextgenusfs commented 6 years ago

I think the more recent version should display the version and the latest commit if you are running a git clone version. The master repo should be stable as I was going to cut a new release 1.3.0 shortly.

AnotherSimon commented 6 years ago

yup, it's showing 1.3.0-c0615fb now

AnotherSimon commented 6 years ago

Results look the same unfortunately, could it be a consequence of order in PATH? My system has an independent installation of the blast suit of tools (v.2.3.0+, can be located with the which blastn command). However, after I add the paths of all the dependencies to PATH (as such: PATH=/list/of:/paths/to/be/added:$PATH), the same command points to an alternative build of blastn/blastp that came packaged with rmblast:


Proteinortho with PoFF version 5.16b - An orthology detection tool


Using 20 CPU threads, Detected NCBI BLAST version 2.2.27+ Checking input files Checking Bug1.faa... Bug1.faa 9432 genes ok Checking Bug2.faa... Bug2.faa 7793 genes ok ... Step 1 Generating indices Building database for 'Bug1.faa' (9432 sequences) Error: NCBI C++ Exception: "/home/simon/software/ncbi-rmblastn-2.2.28-src/c++/src/objtools/blast/seqdb_writer/build_db.cpp", line 304: Error: ncbi::s_FixBioseqDeltas() - Protein delta sequences are not supported. ...

nextgenusfs commented 6 years ago

I don’t think there is suppose to be a blastp executable distributes with rmblast? There should just be the rmblastn executable? I would just delete the rmblast associated blastp symlink/executable.

AnotherSimon commented 6 years ago

Ok, moving rmblast to the end of $PATH did the trick for me, funannotate will go back to using the "normal" blastp executable. Not sure how to prevent other users from experiencing this bug.

I have hit a new snag though after protortho does it's work. From StdOut:

... [01:57 PM]: Running orthologous clustering tool, ProteinOrtho5. This may take awhile... [06:13 PM]: Calculating dN/dS ratios for each ortholog group, 5438 orthologous groups Progress: 0.00% Progress: 0.72% ... Progress: 99.01% Progress: 99.78% [06:18 PM]: Compiling all annotations for each genome [06:19 PM]: Inferring phylogeny using RAxML [06:19 PM]: Found 97 single copy BUSCO orthologs, will use all to infer phylogeny load obo file /my/FUNANNOTATE_DB/path/funannotate_db/go.obo /my/FUNANNOTATE_DB/path/funannotate_db/go.obo: fmt(1.2) rel(2018-02-20) 47,177 GO Terms Traceback (most recent call last): File "/home/simon/software/funannotate/bin/funannotate-compare.py", line 1105, in lib.ortho2phylogeny(folder, sco_final, args.num_orthos, busco, args.cpus, args.bootstrap, phylogeny, outgroup, outgroup_species, outgroup_name, sc_buscos) File "/home/simon/software/funannotate/lib/library.py", line 5127, in ortho2phylogeny runSubprocess2(cmd, '.', log, os.path.join(tmpdir,'phylogeny.mafft.fa')) File "/home/simon/software/funannotate/lib/library.py", line 407, in runSubprocess2 proc = subprocess.Popen(cmd, cwd=dir, stdout=out, stderr=subprocess.PIPE) File "/home/ppa/software/lib/python2.7/subprocess.py", line 710, in init errread, errwrite) File "/home/ppa/software/lib/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception OSError: [Errno 8] Exec format error

I checked the contents of the funannotate_compare/phylogeny folder and indeed the file phylogeny.mafft.fa is empty (0 kb). The two other files (phylogeny.buscos.used.txt and phylogeny.concat.fa) appear to be populated with the kind of data you would expect here.

A few other tidbits found in funannotate-compare.log:

Step 2

Running blast analysis: 0% (0/105)CFastaReader: Hyphens are invalid and will be ignored around line 318 CFastaReader: Hyphens are invalid and will be ignored around line 346 CFastaReader: Hyphens are invalid and will be ignored around line 364 CFastaReader: Hyphens are invalid and will be ignored around line 378 ...

Step 3 Clustering by similarity (Proteinortho mode) Reading funannotate.blast-graph 14 species 66305 paired proteins 393476 bidirectional edges

Clustering: 0.00%

Clustering: 0.01%

Clustering: 99.33% Done
Adding singles... [OUTPUT] -> written to funannotate.poff Writing graph... Cleaning edge list... Removed 0 / 425214 edges Done. [OUTPUT] -> written to funannotate.poff-graph

All finished.

[05/07/18 18:13:50]: There are 10967 entries in the proteinortho output [05/07/18 18:13:51]: There seem to be 97 single copy orthologs [05/07/18 18:13:51]: There are a total of 5438 orthologs groups [05/07/18 18:13:53]: Calculating dN/dS ratios for each ortholog group, 5438 orthologous groups [05/07/18 18:16:19]: Running simple dN/dS estimate [05/07/18 18:18:39]: Compiling all annotations for each genome [05/07/18 18:19:05]: Inferring phylogeny using RAxML [05/07/18 18:19:06]: Found 97 single copy BUSCO orthologs, will use all to infer phylogeny [05/07/18 18:19:07]: mafft --quiet funannotate_compare/phylogeny/phylogeny.concat.fa

Update: It appears to be an issue with mafft. Did you ever encounter any error messages like this?:

$ /home/simon/software/mafft-7.397-without-extensions/binaries/mafft funannotate_compare/phylogeny/phylogeny.concat.fa /home/simon/software/mafft-7.397-without-extensions/binaries/mafft: line 1: .": command not found /home/simon/software/mafft-7.397-without-extensions/binaries/mafft: line 2: syntax error near unexpected token 'newline' /home/simon/software/mafft-7.397-without-extensions/binaries/mafft: line 2: '.\" Author: Kazutaka Katoh kazutaka.katoh@aist.go.jp'

Update2: This bug is a consequence of a faulty mafft binary built without root permission. System-wide install of mafft fixed this part of the issue. (currently rerunning funannotate compare)

AnotherSimon commented 6 years ago

Regarding the GO term issue I mentioned earlier in this thread, any idea what could be going wrong there?

Additionally, could the search for GO enrichment be parallelized? Currently it's rather time consuming and only seems to use one CPU and minimal amounts of memory.

nextgenusfs commented 6 years ago

If there isn't any enrichment, then that file is just the header - so I don't think it is an error. The associations.txt file contains all of the GOs for every genome, while then the population is all of the genes, and it runs each genome in comparison to the population -- you can read more here https://github.com/tanghaibao/goatools. I will look into running the searches in parallel - I think that shouldn't be too difficult.

nextgenusfs commented 6 years ago

Should be spread across multiple CPUs now: https://github.com/nextgenusfs/funannotate/commit/8e2c9842cb871514006765c0e79297824de3574a

AnotherSimon commented 6 years ago

currently on 1.3.2-c32453b but still not running parallel in my environment: one thread runs at 100% of a single CPU and then there are a bunch of idle python threads listed as daughter processes

nextgenusfs commented 6 years ago

Thanks -- sorry about that. Was actually a "tab" error in that script where the files weren't getting added the the multiprocessor list correctly -- amazing that it ran without error on my test set (which is composed of only a handful of gene models so couldn't measure whether it was working or not). Latest commit seems to fix this.

AnotherSimon commented 6 years ago

Yup, seems to be running properly parallel now (results still null for my particular bugs though).

I'm now having some issues getting the combined phylogeny for the detected ortho groups. The log file reads:

... [10:59 AM]: Compiling all annotations for each genome [10:59 AM]: Inferring phylogeny using RAxML [10:59 AM]: 0 single copy BUSCO orthologs found, skipping phylogeny [10:59 AM]: Compressing results to output file: funannotate_compare_noHybrids.tar.gz [11:03 AM]: Funannotate compare completed successfully! load obo file /path/to/funannotate_db/go.obo /path/to/funannotate_db/go.obo: fmt(1.2) rel(2018-02-20) 47,177 GO Terms ...

For my current study I'm working on some aneuploid yeasts so they sometimes have duplicate genes which I figured might be messing with the selection of the ortho groups. However when I compare only the haploid yeasts, the error is the same. I wrote a script to check the gene locus tags in each ortho group and there are plenty that have exactly one gene for each yeast. Further, the summary table shows that BUSCO groups were assigned to at least some of the ortho groups. The only other thing I can think of is that the BUSCO groups were not correctly labeled in my self-added outgroup for this comparison. Any suggetions for debugging?

On a related note, I noticed iqtree is a new option for phylogeny but the log file reads raxml regardless of the method chosen.

nextgenusfs commented 6 years ago

The single copy orthologs are somewhat simply chosen by parsing the funannotate.poff output of proteinortho5 -- basically it just searches for all orthologous groups that have a single protein from each genome -- it then cross lists those results with the busco results trying to make sure these proteins are part of the core eukaryote set and useful for phylogeny. I didn't really intend for this to be a full-out whole genome phylogeny method - more like a preliminary tree - I could relax the requirement that the orthologs have to be busco members.

What are you using for outgroups? The outgroups need to be run with the same busco_db - otherwise they won't match up.

$ funannotate outgroups --show_outgroups
-----------------------------
BUSCO Outgroups:
-----------------------------
laccaria_bicolor.dikarya             aspergillus_nidulans.dikarya         coprinopsis_cinerea.dikarya          
schizosaccharomyces_pombe.dikarya    botrytis_cinerea.dikarya             saccharomyces_cerevisiae.dikarya    

So you probably want to be using the basidiomycota busco_db? That also means you would need to specify the --busco_db to funannotate annotate for each of your genomes as well as run the desired outgroup through funannoate outgroups script using the same db, i.e. funannotate outgroups -i my_outgroup.fasta --species "Awesome species" --busco_db basidiomycota. This will generate the necessary files and should update your DB to now have awesome_species.basidiomycota to the outgroup list.

$ funannotate outgroups --show_buscos
-----------------------------
BUSCO DB tree: (# of models)
-----------------------------
eukaryota (303)
    metazoa (978)
        nematoda (982)
        arthropoda (1066)
            insecta (1658)
            endopterygota (2442)
            hymenoptera (4415)
            diptera (2799)
        vertebrata (2586)
            actinopterygii (4584)
            tetrapoda (3950)
            aves (4915)
            mammalia (4104)
        euarchontoglires (6192)
            laurasiatheria (6253)
    fungi (290)
        dikarya (1312)
            ascomycota (1315)
                pezizomycotina (3156)
                    eurotiomycetes (4046)
                    sordariomycetes (3725)
                    saccharomycetes (1759)
                        saccharomycetales (1711)
            basidiomycota (1335)
        microsporidia (518)
    embryophyta (1440)
    protists (215)
        alveolata_stramenophiles (234)

I have worked with Malassezia before so I did write some diploid/aneuploid aware scripts for phylogeny -- which are at https://github.com/nextgenusfs/phyloma -- the scripts all have menus but there is virtually no docs yet (sorry). Essentially the phyloma busco takes multiple proteomes (or GBK) files and runs busco on them, then parses out the results and builds a concatenated fasta for each genome based on those data. You can then use that to draw a phylogeny using phyloma raxml.

AnotherSimon commented 6 years ago

Thanks for the feedback. It was not clear to me (from the documentation) that you need to specify the desired BUSCO group at the annotation stage already. Makes a lot more sense now. (Instructions on adding outgroups are already good imho.)

A complete and "final" phylogeny may prove illusive but indeed a flag to relax the BUSCO requirement might be worth it from a cost-benefit point of view.

I'll give phyloma a try and let you know how it worked out for my use case.