Closed AnotherSimon closed 5 years ago
Okay, I think I can add an arbitrary number to genomes having same name - but if you added --strain
info it should be using that as well so probably a bug. The other issue might be if the genomes were all annotated with default settings under --name
in funannotate predict, i.e. every model has FUN_
as the locus tag. But I don't think that is happening as there should be a warning to the user.
It seems that perhaps proteinortho is trying to use rmbastn instead of blastp, not sure how that is happening? Can you try to run the proteinortho test data (I think it is packaged with some test data but might depend on how you installed) and make sure it is functioning?
Sorry about all the bugs - I'm grateful you keep finding them - and I will be a little bit slower than usual for next week or so on getting things fixed.
Jon
Glad to help you idiot-proof the code and good to hear you're not tiered of fixing them.
I reran funannotate annotate
again and specified the --strain
option. Seems to have fixed the naming issue for funannotate compare
but I still had to manually rename the locus tags, for some reason --rename
didn't work. (I also did the manual edit the first time around.)
Previously I just assumed that the strain info would be inherited from funannotate train
and/or funannotate predict
just like --species
is. (If I remember correctly the suggested command generated at the end did not include this info.)
As for proteinortho5, test seems successful:
$ make test g++ -Wall -O2 -std=gnu++11 -Wno-unused-result -o proteinortho5_clustering proteinortho5_clustering.cpp ./proteinortho5.pl -project=test -synteny -singles test/*.faa
Proteinortho with PoFF version 5.16b - An orthology detection tool
Detected 20 available CPU threads, Detected NCBI BLAST version 2.2.27+ Checking input files Checking test/C.faa... test/C.faa 109 genes ok Checking test/E.faa... test/E.faa 72 genes ok Checking test/L.faa... test/L.faa 40 genes ok Checking test/M.faa... test/M.faa 40 genes ok
Step 1 Generating indices Building database for 'test/C.faa' (109 sequences) Building database for 'test/E.faa' (72 sequences) ... Cleaning edge list... Removed 0 / 104 edges Done. [OUTPUT] -> written to test.poff-graph
All finished. rm test.* Test okay
Was debating listing the following issues as a separate thread but the title of this one seems to appropriate not to use. Alright what I've found so far, in order of folder names:
MyBug.go.enrichment.txt
files beyond the 11 header lines (see next point); it also appears that the enrichment is not calculated in parallel, is this by necessity or oversightassociations.txt
looks fine (several GO terms found for most genes and locus tags differ for each strain), however funannotate_compare/go_terms/MyBug_strainNum.txt
files only hold gene names, no GO terms; same for funannotate_compare/go_terms/population.txt
Hi John,
Any progress on the proteinortho bug? Is it reproducible on your end? Any info you need from me?
I’m away from my computer for a few days but I did add the explicit path to blastp to the proteinortho command, is it still trying to use rmblast? I thought I fixed the naming issue as well but can’t remember for sure off the top of my head.
ok, I'll update my local copy to current master branch and try again (or is there a specific commit that you were thinking of?)
I think the blast path was awhile ago https://github.com/nextgenusfs/funannotate/commit/f0b37df5edadfd4980114c451f17f02c80be9d2d and I may not have fixed the incremental naming of files in compare (I’m on my phone so hard to look through the commits). I guess if you can get the latest version and let me know if problem still exists I can try to address this week.
When the bug occurred I was running 1.3.0-beta but that branch seems to have been pulled.
Edit: nvmnd, the current version is also 1.3.0 but this is not reflected in the README.md
I think the more recent version should display the version and the latest commit if you are running a git clone version. The master repo should be stable as I was going to cut a new release 1.3.0 shortly.
yup, it's showing 1.3.0-c0615fb
now
Results look the same unfortunately, could it be a consequence of order in PATH? My system has an independent installation of the blast suit of tools (v.2.3.0+, can be located with the which blastn
command). However, after I add the paths of all the dependencies to PATH (as such: PATH=/list/of:/paths/to/be/added:$PATH
), the same command points to an alternative build of blastn/blastp that came packaged with rmblast:
Proteinortho with PoFF version 5.16b - An orthology detection tool
Using 20 CPU threads, Detected NCBI BLAST version 2.2.27+ Checking input files Checking Bug1.faa... Bug1.faa 9432 genes ok Checking Bug2.faa... Bug2.faa 7793 genes ok ... Step 1 Generating indices Building database for 'Bug1.faa' (9432 sequences) Error: NCBI C++ Exception: "/home/simon/software/ncbi-rmblastn-2.2.28-src/c++/src/objtools/blast/seqdb_writer/build_db.cpp", line 304: Error: ncbi::s_FixBioseqDeltas() - Protein delta sequences are not supported. ...
I don’t think there is suppose to be a blastp executable distributes with rmblast? There should just be the rmblastn executable? I would just delete the rmblast associated blastp symlink/executable.
Ok, moving rmblast to the end of $PATH did the trick for me, funannotate will go back to using the "normal" blastp executable. Not sure how to prevent other users from experiencing this bug.
I have hit a new snag though after protortho does it's work. From StdOut:
... [01:57 PM]: Running orthologous clustering tool, ProteinOrtho5. This may take awhile... [06:13 PM]: Calculating dN/dS ratios for each ortholog group, 5438 orthologous groups Progress: 0.00% Progress: 0.72% ... Progress: 99.01% Progress: 99.78% [06:18 PM]: Compiling all annotations for each genome [06:19 PM]: Inferring phylogeny using RAxML [06:19 PM]: Found 97 single copy BUSCO orthologs, will use all to infer phylogeny load obo file /my/FUNANNOTATE_DB/path/funannotate_db/go.obo /my/FUNANNOTATE_DB/path/funannotate_db/go.obo: fmt(1.2) rel(2018-02-20) 47,177 GO Terms Traceback (most recent call last): File "/home/simon/software/funannotate/bin/funannotate-compare.py", line 1105, in
lib.ortho2phylogeny(folder, sco_final, args.num_orthos, busco, args.cpus, args.bootstrap, phylogeny, outgroup, outgroup_species, outgroup_name, sc_buscos) File "/home/simon/software/funannotate/lib/library.py", line 5127, in ortho2phylogeny runSubprocess2(cmd, '.', log, os.path.join(tmpdir,'phylogeny.mafft.fa')) File "/home/simon/software/funannotate/lib/library.py", line 407, in runSubprocess2 proc = subprocess.Popen(cmd, cwd=dir, stdout=out, stderr=subprocess.PIPE) File "/home/ppa/software/lib/python2.7/subprocess.py", line 710, in init errread, errwrite) File "/home/ppa/software/lib/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception OSError: [Errno 8] Exec format error
I checked the contents of the funannotate_compare/phylogeny
folder and indeed the file phylogeny.mafft.fa
is empty (0 kb). The two other files (phylogeny.buscos.used.txt
and phylogeny.concat.fa
) appear to be populated with the kind of data you would expect here.
A few other tidbits found in funannotate-compare.log
:
Step 2
Running blast analysis: 0% (0/105)CFastaReader: Hyphens are invalid and will be ignored around line 318 CFastaReader: Hyphens are invalid and will be ignored around line 346 CFastaReader: Hyphens are invalid and will be ignored around line 364 CFastaReader: Hyphens are invalid and will be ignored around line 378 ...
Step 3 Clustering by similarity (Proteinortho mode) Reading funannotate.blast-graph 14 species 66305 paired proteins 393476 bidirectional edges
Clustering: 0.00%
Clustering: 0.01%
Clustering: 99.33% Done
Adding singles... [OUTPUT] -> written to funannotate.poff Writing graph... Cleaning edge list... Removed 0 / 425214 edges Done. [OUTPUT] -> written to funannotate.poff-graphAll finished.
[05/07/18 18:13:50]: There are 10967 entries in the proteinortho output [05/07/18 18:13:51]: There seem to be 97 single copy orthologs [05/07/18 18:13:51]: There are a total of 5438 orthologs groups [05/07/18 18:13:53]: Calculating dN/dS ratios for each ortholog group, 5438 orthologous groups [05/07/18 18:16:19]: Running simple dN/dS estimate [05/07/18 18:18:39]: Compiling all annotations for each genome [05/07/18 18:19:05]: Inferring phylogeny using RAxML [05/07/18 18:19:06]: Found 97 single copy BUSCO orthologs, will use all to infer phylogeny [05/07/18 18:19:07]: mafft --quiet funannotate_compare/phylogeny/phylogeny.concat.fa
Update: It appears to be an issue with mafft. Did you ever encounter any error messages like this?:
$ /home/simon/software/mafft-7.397-without-extensions/binaries/mafft funannotate_compare/phylogeny/phylogeny.concat.fa /home/simon/software/mafft-7.397-without-extensions/binaries/mafft: line 1: .": command not found /home/simon/software/mafft-7.397-without-extensions/binaries/mafft: line 2: syntax error near unexpected token 'newline' /home/simon/software/mafft-7.397-without-extensions/binaries/mafft: line 2: '.\" Author: Kazutaka Katoh kazutaka.katoh@aist.go.jp'
Update2:
This bug is a consequence of a faulty mafft binary built without root permission. System-wide install of mafft fixed this part of the issue. (currently rerunning funannotate compare
)
Regarding the GO term issue I mentioned earlier in this thread, any idea what could be going wrong there?
Additionally, could the search for GO enrichment be parallelized? Currently it's rather time consuming and only seems to use one CPU and minimal amounts of memory.
If there isn't any enrichment, then that file is just the header - so I don't think it is an error. The associations.txt file contains all of the GOs for every genome, while then the population is all of the genes, and it runs each genome in comparison to the population -- you can read more here https://github.com/tanghaibao/goatools. I will look into running the searches in parallel - I think that shouldn't be too difficult.
Should be spread across multiple CPUs now: https://github.com/nextgenusfs/funannotate/commit/8e2c9842cb871514006765c0e79297824de3574a
currently on 1.3.2-c32453b but still not running parallel in my environment: one thread runs at 100% of a single CPU and then there are a bunch of idle python threads listed as daughter processes
Thanks -- sorry about that. Was actually a "tab" error in that script where the files weren't getting added the the multiprocessor list correctly -- amazing that it ran without error on my test set (which is composed of only a handful of gene models so couldn't measure whether it was working or not). Latest commit seems to fix this.
Yup, seems to be running properly parallel now (results still null for my particular bugs though).
I'm now having some issues getting the combined phylogeny for the detected ortho groups. The log file reads:
... [10:59 AM]: Compiling all annotations for each genome [10:59 AM]: Inferring phylogeny using RAxML [10:59 AM]: 0 single copy BUSCO orthologs found, skipping phylogeny [10:59 AM]: Compressing results to output file: funannotate_compare_noHybrids.tar.gz [11:03 AM]: Funannotate compare completed successfully! load obo file /path/to/funannotate_db/go.obo /path/to/funannotate_db/go.obo: fmt(1.2) rel(2018-02-20) 47,177 GO Terms ...
For my current study I'm working on some aneuploid yeasts so they sometimes have duplicate genes which I figured might be messing with the selection of the ortho groups. However when I compare only the haploid yeasts, the error is the same. I wrote a script to check the gene locus tags in each ortho group and there are plenty that have exactly one gene for each yeast. Further, the summary table shows that BUSCO groups were assigned to at least some of the ortho groups. The only other thing I can think of is that the BUSCO groups were not correctly labeled in my self-added outgroup for this comparison. Any suggetions for debugging?
On a related note, I noticed iqtree is a new option for phylogeny but the log file reads raxml regardless of the method chosen.
The single copy orthologs are somewhat simply chosen by parsing the funannotate.poff output of proteinortho5 -- basically it just searches for all orthologous groups that have a single protein from each genome -- it then cross lists those results with the busco results trying to make sure these proteins are part of the core eukaryote set and useful for phylogeny. I didn't really intend for this to be a full-out whole genome phylogeny method - more like a preliminary tree - I could relax the requirement that the orthologs have to be busco members.
What are you using for outgroups? The outgroups need to be run with the same busco_db - otherwise they won't match up.
$ funannotate outgroups --show_outgroups
-----------------------------
BUSCO Outgroups:
-----------------------------
laccaria_bicolor.dikarya aspergillus_nidulans.dikarya coprinopsis_cinerea.dikarya
schizosaccharomyces_pombe.dikarya botrytis_cinerea.dikarya saccharomyces_cerevisiae.dikarya
So you probably want to be using the basidiomycota busco_db? That also means you would need to specify the --busco_db
to funannotate annotate
for each of your genomes as well as run the desired outgroup through funannoate outgroups
script using the same db, i.e. funannotate outgroups -i my_outgroup.fasta --species "Awesome species" --busco_db basidiomycota
. This will generate the necessary files and should update your DB to now have awesome_species.basidiomycota to the outgroup list.
$ funannotate outgroups --show_buscos
-----------------------------
BUSCO DB tree: (# of models)
-----------------------------
eukaryota (303)
metazoa (978)
nematoda (982)
arthropoda (1066)
insecta (1658)
endopterygota (2442)
hymenoptera (4415)
diptera (2799)
vertebrata (2586)
actinopterygii (4584)
tetrapoda (3950)
aves (4915)
mammalia (4104)
euarchontoglires (6192)
laurasiatheria (6253)
fungi (290)
dikarya (1312)
ascomycota (1315)
pezizomycotina (3156)
eurotiomycetes (4046)
sordariomycetes (3725)
saccharomycetes (1759)
saccharomycetales (1711)
basidiomycota (1335)
microsporidia (518)
embryophyta (1440)
protists (215)
alveolata_stramenophiles (234)
I have worked with Malassezia before so I did write some diploid/aneuploid aware scripts for phylogeny -- which are at https://github.com/nextgenusfs/phyloma -- the scripts all have menus but there is virtually no docs yet (sorry). Essentially the phyloma busco
takes multiple proteomes (or GBK) files and runs busco on them, then parses out the results and builds a concatenated fasta for each genome based on those data. You can then use that to draw a phylogeny using phyloma raxml
.
Thanks for the feedback. It was not clear to me (from the documentation) that you need to specify the desired BUSCO group at the annotation stage already. Makes a lot more sense now. (Instructions on adding outgroups are already good imho.)
A complete and "final" phylogeny may prove illusive but indeed a flag to relax the BUSCO requirement might be worth it from a cost-benefit point of view.
I'll give phyloma a try and let you know how it worked out for my use case.
For my current project I'm working on a comparison of various strains within the same species. I used the
--strain
option to annotate each of the genomes to be compared.When running
funannotate compare
most of the outputs are unusable because the naming stops at the species level and I suspect it is also messing with some of the output files due to their naming scheme (for example:./WorkingDir/funannotate_compare/go_terms/Genus_species.txt
)Update: In a likely related matter, I'm getting the following error for ProteinOrtho5 step:
However a search in
./WorkingDir/funannotate_compare/protortho
did not yield any duplicate results for "MyBugStrainNum_000001-T1" within any of the human-readable files.Update2: Also found in
funannotate-compare.log
:I strongly suspect this is because the different strains are re-added as the same species.