nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 87 forks source link

funannotate Bioconda recipe #194

Closed nextgenusfs closed 5 years ago

nextgenusfs commented 6 years ago

I would like to make a bioconda package for funannotate -- as it is still difficult to install. Would be great if a conda expert would be willing to help out. So far there are some dependencies still missing would need to get done first.

1) Evidence Modeler is not in bioconda 2) Trinity and PASA are currently Linux only (not available for macOS) -- seems to be a compiler incompatibility? 3) Augustus via bioconda is not fully functional on macOS -- I use a slightly modified version of augustus v3.2.1 4) funannotate code is currently python2 only (needs to be migrated to py2/3 compatible) -- note this is perhaps not required for a recipe but would help with future compatibility.

Any help/guidance would be appreciated!

hyphaltip commented 6 years ago

We started something for FGMP and the problems with Augustus 3.3 on bioconda also a problem.

Maybe we can tweet about this some of the bioconda devs are perhaps able to contribute suggestions.

Jason Stajich, PhD jasonstajich.phd@gmail.com On Jul 6, 2018, 12:11 PM -0700, Jon Palmer notifications@github.com, wrote:

I would like to make a bioconda package for funannotate -- as it is still difficult to install. Would be great if a conda expert would be willing to help out. So far there are some dependencies still missing would need to get done first.

  1. Evidence Modeler is not in bioconda
  2. Trinity and PASA are currently Linux only (not available for macOS) -- seems to be a compiler incompatibility?
  3. Augustus via bioconda is not fully functional on macOS -- I use a slightly modified version of augustus v3.2.1
  4. funannotate code is currently python2 only (needs to be migrated to py2/3 compatible) -- note this is perhaps not required for a recipe but would help with future compatibility.

Any help/guidance would be appreciated! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

kastman commented 6 years ago

Hi @nextgenusfs @hyphaltip ,

I'm just discovering funannotate now and am impressed; thanks for the work!

I helped commit the OSX augustus bioconda build (about 5 months ago) and wasn't aware it was breaking - do you have an issue to point me to for more of a failure log?

I see in reading the notes from your custom macOS augustus build that it appears to be a compiler compatibility problem with bamtools, causing the proteinprofile BUSCO search to fail; is that it? Maybe we could add a test directly to the augustus recipe to catch that if handling both bamtools and augustus installs with conda doesn't catch it.

As for trinity and pasa, I'm not sure what's going on. Bioconda has been in the process of rebuilding all their recipes for the last 3 weeks to use the newer conda build system, and things are breaking a bit - for example there is a pasa recipe (that appears to support OSX) but it doesn't show up on the bioconda search page. Might be worth focusing on other development for a few weeks to let the dust settle on this one, and then swing back around to it.

I'm happy to help out, as a widespread euk annotation pipeline is definitely needed. Thanks!

kastman commented 6 years ago

PS - As I'm literally brand new to funannotate, I'm currently stepping through the tutorial and will use the bioconda bamtools and augustus installs; most likely I'll find the error on my own. :) But just in case I don't, send a link if you have it. Thanks,

nextgenusfs commented 6 years ago

Hi @kastman, thanks for the help. Yes, I could only ever get the proteinprofile to work on OSX with v3.2.1, more recent version seem to compile, but fail at runtime. So this means that it will fail during BUSCO runs (as that is what uses --proteinprofile). The other problem that you highlighted was getting Bamtools properly linked to the compilation of filterbam and bamtools which are both used by BRAKER for training Augustus. Seems like the bioconda version is correctly compiled/linked to bamtools as the BRAKER check passes. The check that funannotate runs right now is here:

def checkAugustusFunc(base):
    '''
    function to try to test Augustus installation is working, note segmentation fault still results in a pass
    '''
    brakerpass = 0
    buscopass = 0
    version = subprocess.Popen(['augustus', '--version'], stderr=subprocess.STDOUT, stdout=subprocess.PIPE).communicate()[0].rstrip()
    version = version.split(' is ')[0]
    bam2hints = which(os.path.join(base, 'bin', 'bam2hints'))
    filterBam = which(os.path.join(base, 'bin', 'filterBam'))
    if bam2hints and filterBam:
        brakerpass = 1
    model = os.path.join(parentdir, 'lib', 'EOG092C0B3U.prfl')
    if not os.path.isfile(model):
        log.error("Testing Augustus Error: installation seems wrong, can't find prfl model")
        sys.exit(1)
    profile = '--proteinprofile='+model
    proteinprofile = subprocess.Popen(['augustus', '--species=anidulans', profile, os.path.join(parentdir, 'lib', 'busco_test.fa')], stderr=subprocess.STDOUT, stdout=subprocess.PIPE).communicate()[0].rstrip()
    proteinprofile.strip()
    if proteinprofile == '':
        buscopass = 0
    elif not 'augustus: ERROR' in proteinprofile:
        buscopass = 1
    return (version, brakerpass, buscopass)

So here is what happens with bioconda augustus v3.3

$ augustus --species=anidulans --proteinprofile=../lib/EOG092C0B3U.prfl ../lib/busco_test.fa 
# This output was generated with AUGUSTUS (version 3.3).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer and L. Romoth.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# Initialising the parameters using config directory /Users/jon/miniconda2/config/ ...

augustus: ERROR
    PP::Profile: Error parsing pattern file"../lib/EOG092C0B3U.prfl", line 8.

This is what should happen if it is compiled correctly:

$ augustus --species=anidulans --proteinprofile=../lib/EOG092C0B3U.prfl ../lib/busco_test.fa 
# This output was generated with AUGUSTUS (version 3.2.1).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer and L. Romoth.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# Initialising the parameters using config directory /Users/jon/software/augustus/config/ ...
Warning: Block unknown_E is not significant enough, removed from profile.
Warning: Block unknown_F is not significant enough, removed from profile.
Warning: Block unknown_H is not significant enough, removed from profile.
Warning: Block unknown_AC is not significant enough, removed from profile.
# Using protein profile unknown
# --[0..117]--> unknown_A (9) <--[2..25]--> unknown_B (27) <--[1..16]--> unknown_C (8) <--[0..1]--> unknown_D (15) <--[18..100]--> unknown_G (19) <--[8..25]--> unknown_I (32) <--[0..1]--> unknown_J (33) <--[1..16]--> unknown_K (38) <--[1..3]--> unknown_L (14) <--[0..5]--> unknown_M (59) <--[0..19]--> unknown_N (23) <--[0..145]--> unknown_O (23) <--[3..18]--> unknown_P (27) <--[1..44]--> unknown_Q (12) <--[10..82]--> unknown_R (13) <--[10..106]--> unknown_S (18) <--[1..11]--> unknown_T (32) <--[2..5]--> unknown_U (12) <--[0..1]--> unknown_V (32) <--[7..18]--> unknown_W (13) <--[3..8]--> unknown_X (87) <--[0..1]--> unknown_Y (12) <--[2..33]--> unknown_Z (40) <--[0..11]--> unknown_AA (16) <--[3..30]--> unknown_AB (19) <--[8..47]--> unknown_AD (23) <--[0..1]--> unknown_AE (13) <--[0..38]--
# anidulans version. Using default transition matrix.
# Looks like ../lib/busco_test.fa is in fasta format.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 3801, name = example) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 1 on both strands
# start gene g1
example AUGUSTUS    gene    788 3077    0.81    +   .   g1
example AUGUSTUS    transcript  788 3077    0.81    +   .   g1.t1
example AUGUSTUS    start_codon 788 790 .   +   0   transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS    CDS 788 996 1   +   0   transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS    CDS 1049    3077    0.81    +   1   transcript_id "g1.t1"; gene_id "g1";
example AUGUSTUS    stop_codon  3075    3077    .   +   0   transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MDISDLIEPPQKRLKTEDISSADEVVLPAGGITPQTDNEIDEQLSKEIEVGITEFVSADNEGFAGILKKRYTDFLVNE
# ILPSGKVLHLTNTTAPNTNDEATPVQADKKPAEDKPKEPETPAEKLPAPVEFQLAEEDEALLDTLFGTQNTKKIVALHKKALANPKTKPSDLGRLNTV
# VVNDRDQRIKMHQAIRRIFNSQIESSTDSEGMMVISVAANRNKKNPQGGGGGRERPRVNWDELGGQYLHFTIYKENKDTMEVISFIARQLKMNPKSFQ
# FAGTKDRRGVTVQRACAYRLQADRLAKLNRTLRNAVVGDFEYQPHGLELGDLYGNEFVVTLRECEVPGINIQDPASAVAKTKELVNTSLKNLYQRGYF
# NYYGLQRFGSFATRTDTVGVKILQDDFKGACDAILDYSPHILAAAQAELGQGEGEGATPTNISSEDKARALAIHIFRTTDRVTDALEKMPRKFSAESN
# IIRHLGRSKNDYLGALQTIPRNLRLMYVHAYQSLVWNLAVGERWRLYGDRVVEGDLVLIHEHRDKDGNSSYTTPAPGAGASGETTTIDADGEIIIVPQ
# EHDSAFAVEDTFTRARALTAAEANSGLYSIFDIVLPLPGFDVLYPPNKMTDFYKEFMGSSRGGGLDPFNMRRKWKDASLSGSYRKVLSRMGRDYSVDV
# VLYSRDEEQFVRTDLENLTLKTRDGGDVDLEKKEGKSEGDKLAVVLKFQLGSSQYATMALRELMRGKVKAYKPDFGGGR]
# end gene g1
###
# command line:
# augustus --species=anidulans --proteinprofile=../lib/EOG092C0B3U.prfl ../lib/busco_test.fa

Those test files are located in funannotate distribution here: https://github.com/nextgenusfs/funannotate/tree/master/lib

kastman commented 6 years ago

I verified your error with bioconda augustus 3.2.3 and 3.3, and read through what you've tried already in #3. Do you know what about the newer gcc does the fix? If we knew what was changed, we could possibly add a patch option to the older gcc. Compiling on different versions of gcc is one of the few limitations of conda-forge/bioconda, which is quite strict (to make sure everything is compatible), so just bumping the gcc version isn't easy / possible, though I think the recent rebuild/overhaul is doing just that. I somehowd doubt that adding 3.2.1 to bioconda would help if it's also still compiled with the older gcc.

I know you haven't had much luck getting feedback, but maybe @mariostanke would be able to weigh in (not sure if he's getting github notifications though)?

Sorry to overtake this ticket - maybe we should move this conversation back to #3?

nextgenusfs commented 6 years ago

Yes would certainly be best to be fixed upstream. Its possible that adding 3.2.1 might actually help, as the --proteinprofile part of Augustus seems to compile properly on OSX using GCC (I've tried gcc-5 through gcc-7 I think -- I don't remember if gcc-4.8 had errors or not). The changes that I made in https://github.com/nextgenusfs/augustus were just in the Makefiles as it is largely fixing the bamtools compilation problems with filterbam and bam2hints -- quite likely there is/was a different/better way to fix those errors. But if there were a functional version on bioconda, seems like it would be simple enough to specify that specific version if the OS was macOS for tools that require Augustus?

Is there a way to test a local bioconda build with 3.2.1 using the existing patches/recipes for 3.2.3 and 3.3 on OSX? I did also see there is a new release http://bioinf.uni-greifswald.de/augustus/binaries/, have not tested it yet.

kastman commented 6 years ago

Augustus was on the bioconda blacklist because the author moved the tarball into an "old" directory and broke the link -- I'm fixing that and will test 3.2.1 as well as update for the recent 3.3.1 release.

You can test a local build by forking bioconda/bioconda-recipes, updating the version and sha256 hash in recipes/augustus, and running circleci build. I'm going to give that a shot and let you know. I suspect it's a compiler problem (still running 4.8) so this likely won't work, but worth a shot?

nextgenusfs commented 6 years ago

So after you build with circleci build can you then test those packages? I've got one of the other software packages, AMPtk onto bioconda, but still very much a noob when it comes to the details of the build and the testing. But either way can update this thread with results of bioconda builds for 3.2.1 and 3.3.1 would be helpful.

kastman commented 6 years ago

If you want to test locally, I usually build using conda build directly, and then conda install --use-local, which uses the new package you just built (using circleci actually builds in an image which is then tossed).

I'm sure the new compiler changes in conda-forge/bioconda will be useful, but there are still some kinds: now it seems there's a problem finding some of the perl modules (yaml). I'll keep you posted, but I'm still working on getting it off the blacklist.

I also noticed the specialized 3.2.1 PR from @camillescott - pulling from Jon's fixed tarball isn't the way bioconda is intended to be used, but may not be the worst idea? Ideally I'd pull in your specific patches and apply them to the upstream tarball.

Regardless, we have to clear it off the blacklist first. I'll let you know how that goes or you can follow along with the PR.

Thanks!

(also adding @lizlandis to this conversation as well so she can see the progress).

hmontenegro commented 6 years ago

How is this moving along?

I think it will be difficult to have a bioconda recipe for funannotate and all its dependencies: when I tried to install as many dependencies as possible from bioconda, there were some packages which conflicted with each other. For example, Augustus installed, but then did not run due to wrong Boost version. Depending on the combination of packages, a different Numpy would be installed, and then loading Numpy segfaulted. And some other similar small problems.

Maybe a better solution would be to create a funannotate Anaconda channel, where developers / contributors would have more control over versions installed, and over applied patches.

I just started learning how to build conda recipes, and I don't have access to Macs, only Linux. That said, I could help creating / testing a conda recipe.

nextgenusfs commented 6 years ago

Difficult = yes. However, that is the whole reason to get a conda recipe together - to avoid the dependency nightmare. I don' t have much experience with conda and no experience with setting up "channels". Shouldn't we be okay if we specify package version numbers?

I know when bioconda updated recently there are a lot of packages that have problems and need to be rebuilt.

We could forge ahead with a linux only recipe -- Augustus on Mac is still not solved (still issues with compilation I think). I wrote an EvidenceModeler recipe here https://github.com/bioconda/bioconda-recipes/pull/10389 but hasn't been merged yet.

kastman commented 6 years ago

Sorry to have dropped this - I didn't make any progress with augustus myself, but it was taken off the blacklist and fixed in August in this PR. Looking now to see if a backported 3.2.1 will fix the compilation.

Certainly a linux-only recipe is better than nothing and the mac problems shouldn't hold it up, though that's where most of my time is spent and where I'm most motivated to keep up the good fight. :)

nextgenusfs commented 6 years ago

Hi @kastman, I was just running the Augustus v3.3 on Linux and noted that the bioconda install "breaks" funannotate. In funannotate, I use the $AUGUSTUS_CONFIG_PATH to get the augustus "base directory" which then I use to get the location of the scripts folder, so it calls the necessary accessory scripts by full path -- thus alleviating the user from also putting the scripts folder in their PATH. I noticed that the bioconda install copies these over to /bin so they are in $PATH. But then these scripts are not found by funannotate the pipeline crashes. The workaround I used post bioconda augustus install is then to symlink a /scripts/ to /bin/ directory -- so that the augustus directory tree is intact. I totally understand why scripts would be copied over to bin, however, that isn't how most manual augustus install "look". For other packages that also use a lot of "accessory" scripts, i.e. pasa and trinity the bioconda recipes put the entire install folder into /conda/opt which then keeps the directory tree intact -- that would be preferable to me. As if I change the way funannotate looks for these scripts -- it will likely break those pipelines where augustus is manually installed.

kastman commented 6 years ago

Hi @nextgenusfs -- yep, copying the install to /conda/opt sounds like a reasonable step, and it sounds like there's precedent; I'll take a deeper look when I get a second. Is it possible to adjust the $AUGUSTUS_CONFIG_PATH to point to the tree correctly too?

I tried to rebuild 3.2.1 now that augustus is off the blacklist over the weekend, but the patch didn't apply properly and I ran out of time. However, I noticed that the big conda update / rebuild is using GCC7, which might fix the compilation problem that 3.2.3 had if it gets rebuilt.

I'll let you know as I figure out more, but that certainly sounds reasonable.

nextgenusfs commented 6 years ago

Yes I think can set the ENV variables like are done in pasa/trinity, i.e. in the build.sh via https://github.com/bioconda/bioconda-recipes/blob/master/recipes/pasa/build.sh

readonly PASAHOME=${PREFIX}/opt/${PKG_NAME}-${PKG_VERSION}

mkdir -p ${PASAHOME}
cp -Rp bin Launch_PASA_pipeline.pl misc_utilities pasa_conf PasaWeb PasaWeb.conf PerlLib PyLib run_PasaWeb.pl SAMPLE_HOOKS schema scripts ${PASAHOME}

mkdir -p ${PREFIX}/etc/conda/activate.d/
echo "export PASAHOME=${PASAHOME}" > ${PREFIX}/etc/conda/activate.d/${PKG_NAME}-${PKG_VERSION}.sh

mkdir -p ${PREFIX}/etc/conda/deactivate.d/
echo "unset PASAHOME" > ${PREFIX}/etc/conda/deactivate.d/${PKG_NAME}-${PKG_VERSION}.sh

So this basically copies over the PASA directory tree to /opt/pasa-2.3.3. So for augustus would be similar I think, do the same install as before and copy over the necessary files to /opt then I think just need to link augustus to /bin and then entire folder structure would be same as install with executable symlinked into conda $PATH.

Here's a quick stab at it (untested):

#!/bin/bash

set -x -e

export INCLUDE_PATH="${PREFIX}/include"
export LIBRARY_PATH="${PREFIX}/lib"
export LD_LIBRARY_PATH="${PREFIX}/lib"
export BOOST_INCLUDE_DIR=${PREFIX}/include
export BOOST_LIBRARY_DIR=${PREFIX}/lib

#export CXXFLAGS=" -std=c++11 -stdlib=libstdc++ -stdlib=libc++ -DUSE_BOOST -I${BOOST_INCLUDE_DIR} -L${BOOST_LIBRARY_DIR}"
export CXXFLAGS=" -std=c++11  -DUSE_BOOST -I${BOOST_INCLUDE_DIR} -L${BOOST_LIBRARY_DIR}"
export LDFLAGS="-L${BOOST_LIBRARY_DIR}"

#setup directories
AUG_HOME=$PREFIX/opt/augustus-$PKG_VERSION
mkdir -p $PREFIX/bin
mkdir -p $AUG_HOME

## Make the software

sed -i.bak -e 's/^CC *=/CXX=/' -e 's/\$(CC)/$(CXX)/g' auxprogs/homGeneMapping/src/Makefile
sed -i.bak -e 's/^CC *=/CXX=/' -e 's/\$(CC)/$(CXX)/g' auxprogs/joingenes/Makefile
# TODO: don't set CC/CXX here when switching to newer compilers
CC=gcc
CXX=g++
if [ "$(uname)" == Darwin ] ; then
  # SQLITE disabled due to compile issue, see: https://svn.boost.org/trac10/ticket/13501
  make CC="${CC}" CXX="${CXX}" COMPGENPRED=true
else
  make CC="${CC}" CXX="${CXX}" COMPGENPRED=true SQLITE=true
fi

## Build Perl

mkdir perl-build
find scripts -name "*.pl" | xargs -I {} mv {} perl-build
cd perl-build
cp ${RECIPE_DIR}/Build.PL ./
perl ./Build.PL
perl ./Build manifest
perl ./Build install --installdirs site

cd ..

## End build perl

cp -Rp scripts config ${AUG_HOME}
mv bin/* $PREFIX/bin/ 

#Add some options to activate

mkdir -p $PREFIX/etc/conda/activate.d/
echo "export AUGUSTUS_CONFIG_PATH=${AUG_HOME}/config/" > $PREFIX/etc/conda/activate.d/augustus-confdir.sh
chmod a+x $PREFIX/etc/conda/activate.d/augustus-confdir.sh

mkdir -p $PREFIX/etc/conda/deactivate.d/
echo "unset AUGUSTUS_CONFIG_PATH" > $PREFIX/etc/conda/deactivate.d/augustus-confdir.sh
chmod a+x $PREFIX/etc/conda/deactivate.d/augustus-confdir.sh

chmod u+rwx $PREFIX/bin/*