nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

braker error when set up AUGUSTUS_CONFIG_PATH in a writable directory different from the central localization installed in HPC #118

Closed Jennie17 closed 6 years ago

Jennie17 commented 6 years ago

Dear Jon,

Thanks for your suggestion for setting up the funannotate in our HPC environment.

It works well with the parameters of --transcript_evidence and --protein_evidence.

But when --rna_bam is added, I got a problem.

Here are the details: braker.log:

NEXT STEP: check files and settings
NEXT STEP: check options
... options check complete.
NEXT STEP: check fasta headers
fasta headers check complete.
NEXT STEP: create SAM header file /scratch/RDS-FSC-WLR53-RW/braker/gnw64r17/RNA2_LGM_header.sam.
SAM file /scratch/RDS-FSC-WLR53-RW/braker/gnw64r17/RNA2_LGM_header.sam complete.
NEXT STEP: check BAM headers
headers check for BAM file /scratch/WLR53/w64r17Pcont_prediction_tryBam/RNA2_LGM.bam complete.
NEXT STEP: make hints from BAM file /scratch/WLR53/w64r17Pcont_prediction_tryBam/RNA2_LGM.bam
Wait a moment, calculating maximum block size that needs to be allocated... .. done

hints from BAM file /scratch/WLR53/w64r17Pcont_prediction_tryBam/RNA2_LGM.bam added.
NEXT STEP: sort hints
hints sorted.
NEXT STEP: summarize multiple identical hints to one
hints joined.
NEXT STEP: filter introns, find strand and change score to 'mult' entry
strands found and score changed.
hints file complete.
NEXT STEP: execute GeneMark-ET
GeneMark-ET finished.
NEXT STEP: convert GeneMark-ET to real gtf format
GeneMark-ET conversion.
NEXT STEP: create new species gnw64r17
Will create parameters for a EUKARYOTIC species!
creating directory /usr/local/augustus/3.2.1/config/species/gnw64r17/ ...
creating /usr/local/augustus/3.2.1/config/species/gnw64r17/gnw64r17_parameters.cfg ...
failed to execute: No such file or directory
------------------------------------------

The problem looks like arising from that I don't have write access to /usr/local/augustus/3.2.1/config/species

We have centrally installed Augustus in the HPC system and in my PBS script I have done "module load augustus", so "/usr/local/augustus/3.2.1/bin" is already in the $PATH.

Since the problem is the central directory of /usr/local/augustus/3.2.1/config cannot be written, so I copy the whole augustus config folder to my local directory, and add: export AUGUSTUS_CONFIG_PATH="/scratch/WLR53/config"

Then braker seems to assume AUGUSTUS_SCRIPTS_PATH is also in /scratch/WLR53/scripts, of course it can't find it and reported error again.

Then I added all the augustus env vir in my PBS script:

module load augustus export AUGUSTUS_BIN_PATH="/usr/local/augustus/3.2.1/bin" export AUGUSTUS_SCRIPTS_PATH="/usr/local/augustus/3.2.1/scripts" export AUGUSTUS_CONFIG_PATH="/scratch/WLR53/config"

Still it tried to find scripts in scratch/WLR53 rather than in /usr/local/augustus/3.2.1/scripts and failed.

Could you please suggest how to configure augustus config path properly so that it can write in the local directory and still able to find all the scripts centrally installed in the system?

Looking forward to your reply

Best wishes

Jingqin

nextgenusfs commented 6 years ago

Hi @Jennie17. Yes you need to have write access to the Augustus config directory, which is mediated by the AUGUSTUS_CONFIG_PATH environmental variable. In funannotate, I use that variable to find some other accessory scripts Augustus scripts, I'm not sure exactly what BRAKER is doing with the several different environmental variables. Funannotate is passing the --AUGUSTUS_CONFIG_PATH variable to Braker but is not passing the --AUGUSTUS_BIN_PATH or the --AUGUSTUS_SCRIPTS_PATH to Braker, since those are different for you that perhaps is the problem. It seems at least from Braker docs that if the --AUGUSTUS_CONFIG_PATH is passed that it will get the BIN and SCRIPTS directory from that information (I don't think that it is searching the environment for those variables, I think they have to be passed at runtime). So I think you have perhaps 2 options at the moment:

1) if the local directory of Augustus is executable, then you should be able to just pass --AUGUSTUS_CONFIG_PATH /path/to/config to funannotate predict and it should work. I think that likely those scripts/binaries are not executable on your system and that is why it isn't working? Or perhaps you haven't tried that exactly? When you pass the --AUGUSTUS_CONFIG_PATH to funannotate, it should over-rule the environmental variable, but I think this was not happening in older versions of funannotate. I'm releasing a large update likely today which should fix this issue and many others.

2) You can run BRAKER outside of funannotate to train Augustus, like so:

braker.pl --genome=genome.softmasked.fa --bam=RNAseq.bam --softmasking 1 \
   --gff3 --AUGUSTUS_CONFIG_PATH=/scratch/WLR53/config \
   --AUGUSTUS_BIN_PATH=/usr/local/augustus/3.2.1/bin \
   --AUGUSTUS_SCRIPTS_PATH=/usr/local/augustus/3.2.1/scripts \
   --species=your_species_name --cores XX

You can then pass the results to funannotate like:

funannotate predict -i genome.fa -o test -s "Your species" --augustus_species your_species_name \
   --genemark_gtf braker/your_species_name/GeneMark-ET/genemark.gtf \
   --transcript_evidence transcripts.fa --protein_evidence proteins.fa --cpus XX

Note the above will re-run Augustus, which you could avoid by passing the --augustus_gff option and finding the result in the braker folder, however, this won't let funannotate parse the augustus result properly and find high quality models (HiQ). It shouldn't take a huge amount of time to re-run as it is executed in parallel.

Jennie17 commented 6 years ago

Dear Jon,

Thank you very much for your suggestion! It works well!

However, when I compared my two runs from funannotate, I am a bit puzzled by the differences.

Surprisingly, the run using --rna_bam get 4,000 gene models less than the run without --rna_bam using existing species.

It looks like the EVM filtered out a lot of more models when I used --rna_bam as compare to no bam file, just Augustus existing species with pr. and transcript evidence .

There is output saying: Found 0 high quality predictions from Augustus (>90% exon evidence). As I don't supply any PASA or maker output files, I doubt those predictions from bam will be removed?

Could you please suggest what to do in such a situation? Is that the results from no bam looks even better?

The command and log details are as below:

For the run without --rnabam, my command is: funannotate predict -i genome.fasta -o runNobam -s "GNw64R17" \ --isolate w6411 --name GN64Pv3 --augustus_species ustilago_maydis \ --protein_evidence proteins.fasta --transcript_evidence ESTtrinity.fasta --cpus 8 The output log: ... 2017-12-06 11:17:39,900: 82,918 transcripts aligned with GMAP ... 2017-12-06 11:18:01,266: Found 12157 high quality predictions from Augustus (>90% exon evidence) ... 2017-12-06 11:18:39,590: 74,064 total gene models from all sources 2017-12-06 11:40:39,046: 35,067 total gene models from EVM

For the run with --rnabam My command: funannotate-predict.py -i genome.fasta -o runBam -s GNw64R17 --isolate w6411 --name GN64 --rna_bam RNA2_LGM.bam --protein_evidence proteins.fasta --transcript_evidence ESTtrinity.fasta --cpus 8 The output log:

16:24:55,505: Pulling out high quality Augustus predictions 16:24:56,399: Found 0 high quality predictions from Augustus (>90% exon evidence) 16:25:01,441: 80,695 total gene models from all sources <----------------------- more models after adding bam file 16:45:24,222: 31,812 total gene models from EVM <----------------------------- fewer models were kept however

Jennie17 commented 6 years ago

Dear Jon,

Another related issue is that when I use --rna_bam and --augustus_species ustilago_maydis together, it will always run into error.

I am not sure about the reason.

Is funannotate assume that if we use bam file, we won't use any existing trained species for Augustus?

Details are:

My command: funannotate-predict.py -i genome.fasta -o runBam -s GNw64R17 --isolate w6411 --name GN64_ --rna_bam RNA2_LGM.bam --augustus_species ustilago_maydis --protein_evidence proteins.fasta --transcript_evidence ESTtrinity.fasta --cpus 8

Log output info: 2017-12-07 04:10:17,195: perl /usr/local/evidencemodeler/v1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl /scratch/WLR53/w64r17Pcont_prediction_tryBam_ustilago/predict_misc/genemark.gff 2017-12-07 04:10:17,260: GeneMark predictions failed, proceeding with only Augustus 2017-12-07 04:10:17,261: Augustus prediction failed, check logfiles/augustus-parallel.log

there is no augustus-parallel.log, instead there is a braker.log:

NEXT STEP: check files and settings NEXT STEP: check options ERROR:/usr/local/augustus/3.2.1/config/species/ustilago_maydis already exists. Choose another species name, delete this directory or use the existing species with the option --useexisting. ... options check complete.

when I add --useexisting to the command above, it gave error information: there is no such a parameter as --useexisting

nextgenusfs commented 6 years ago

Could you please suggest what to do in such a situation? Is that the results from no bam looks even better?

Augustus is very sensitive to the training parameters used, so using parameters for a given species may work well or could not work very well at all. The quality of the RNAseq data used to generate the BAM files also plays a role here. I don't think I would use only the number of genes as a proxy to how well the annotation worked. If you are able to upgrade funannotate, I'm pushing a new release shortly with lots of updated features (see below) that will likely help you.

Another related issue is that when I use --rna_bam and --augustus_species ustilago_maydis together, it will always run into error.

Yes, in this context the --rna_bam is essentially only being used to train Augustus and GeneMark using Braker. The internal script logic defaults to Braker training which as you found out requires the species name to be unique. I'm in the process of releasing a large update to funannotate, I can incorporate this check and automatically add --useexisting to the braker command if species is found. I don't know if that is the "best" course of action or not. Basically it won't train Augustus if you pass --useexisting, which is likely not what you want if you have good RNAseq data.

The current tip repository of funannotate has several new tools to deal with RNAseq data, it has a funannotate train option which will run automated Trinity/Pasa/Braker training of Augustus and gene mark. These data are then integrated better into funannotate predict. And then it also has funannotate update which will use the RNAseq data to add UTRs and finalize gene models.

Jennie17 commented 6 years ago

Dear Jon,

Many thanks for your explanation.

The process of funannotate train --> funannotate predict --> funannotate update sounds great. I'd like to give it a try.

Can you paste the website of "tip repository of funannotate has several new tools to deal with RNAseq data"? Any installation required or just clone and extract?

nextgenusfs commented 6 years ago

Yes, you can just clone this repository. I'm working on docs now and then will do a release. But I think the current repository is stable.

git clone git://github.com/nextgenusfs/funannotate.git

Then just make sure it is in $PATH, i.e. export PATH=/path/to/funanntoate:$PATH. you will need to download databases (they are smaller and easier to handle in this version). You can do that like this:

funannotate setup -d /path/to/funannotate/database

You should be set to go, there are some additional tools that need to be installed, such as Trinity and PASA. Note You can use funannotate check to help determine what is missing from installation. And it will work best to export a funannotate database variable:

export FUNANNOTATE_DB=/path/to/funannotate/database
export TRINITYHOME=/path/to/trinity
export PASAHOME=/path/to/pasa

Let me know how it goes or if you find any bugs.

Jennie17 commented 6 years ago

Dear Jon,

Thanks for developing this terrific software and keep on resolving the issues.

It's the first time for me to do genome annotation and funannotate makes it so much easier!

May I ask you a couple of questions related to tbl2asn process?

From the output log: 2017-12-09 17:20:44,495: Cleaning models flagged by tbl2asn ... 2017-12-09 17:20:48,903: 21,956 gene models remaining ... 2017-12-09 17:24:27,307: [tbl2asn] This copy of tbl2asn is more than a year old. Please download the current version. ... 2017-12-09 17:24:35,850: Funannotate predict is finished, output files are in the /scratch/WLR53/w64r17Pcont_prediction_tryBam/clcW64R17Pcurated.bamV4/predict_results folder 2017-12-09 17:24:35,850: Note, you should fix any tbl2asn errors now before running functional annotation.

My Questions:

  1. What kind of genes models flagged by tbl2asn was filtered out? From the website, tbl2asn sounds just to prepare sequence for submission to GenBank.

  2. Our current version is tbl2asn/1.0. It still works, but not sure whether it will affect gene prediction results.

  3. The log says "Note, you should fix any tbl2asn errors". Which file shall I look for tbl2asn errors and how shall I fix it manually?

nextgenusfs commented 6 years ago

You should definitely update tbl2asn, I think current version is something like 25.3. The filtering in the funannotate version you are currently running is trying to remove models that don't pass tbl2asn filters. However, some of this is due to gene models being malformed, in the current funannotate version I re-wrote the NCBI tbl generation and filter to remove many of these formatting errors. It should alleviate a lot of these issues. Also, the new version will tell you if you need to manually fix any gene models or not.

Jennie17 commented 6 years ago

Hi Jon,

Thanks for the installation instruction.

Our HPC doesn't have PASA yet. Is the latest version from the site below works for your new funannotate? https://pasapipeline.github.io/#A_obt_pasa

We have both centrally installed funannotate in the HPC system and my private directory installed funannotate (so that database can be written).

Do I have to install the new version on both sites or just my personal directory will do?

nextgenusfs commented 6 years ago

Yes, the latest version of PASA will work, I think it is v2.2 https://github.com/PASApipeline/PASApipeline/releases. Depends on how your $PATH is setup I think, if your local is in front of the HPC, then should work. I don't normally work on an HPC so I don't know exactly what the issues will be.

Jennie17 commented 6 years ago

Hi Jon,

Sure, I will let you know how it goes or if any bugs detected (It may takes a while as there is an issue of setting up/running MySQL server in our HPC).

Just remember another thing needs to confirm with you.

I have a diploid genome assembly from PacBio, i.e., primary contigs and haplotigs.

Currently I just annotate the primary contigs using the aforementioned funannotate predict commands.

I am going to annotate the haplotigs separately using the same funannotate predict commands.

I am not sure whether I shall cat primary and haplotigs and just funannotate predict all in one go by adding "--ploidy 2" to my previous command?

Is "--ploidy 2" the right way to annotate PacBio diploid assembly?

Will the diploid annotation in one go give better results than the 2 runs of separate annotation?

What do you reckon?

Thanks!

Jennie17 commented 6 years ago

Hi Jon,

Just remembered one repeated error when I used funannotate.

Sometimes "predict_results" cannot be created automatically along with "logfiles" and "predict_misc".

I noted that someone else has also raised this issue previously.

Sometimes I just make predict_results manually before running funannotate and it works fine.

nextgenusfs commented 6 years ago

Thanks. yes that is fixed in current repository. Very close to a new release, trying to work out the kinks of a docker container first.

Jennie17 commented 6 years ago

Hi Jon,

Thanks for the update. Could you please kindly help with my following question?

I have a diploid genome assembly from PacBio, i.e., primary contigs and haplotigs.

Currently I just annotate the primary contigs using the aforementioned funannotate predict commands.

I am going to annotate the haplotigs separately using the same funannotate predict commands.

I am not sure whether I shall cat primary and haplotigs and just funannotate predict all in one go by adding "--ploidy 2" to my previous command?

Is "--ploidy 2" the right way to annotate PacBio diploid assembly?

Will the diploid annotation in one go give better results than the 2 runs of separate annotation?

Jennie17 commented 6 years ago

Regarding the new release of funannotate, does it require PASA installed as well?

As I discussed with our HPC administrator, I have to install MySQL in my own desktop.

Is the MySQL component required for PASA very large as our HPC administrator pointed out as below?

"Yes, I think you could do that. But beware that some of these things that want real databases need loads of disk space as well - which your workstation might not have. You could then install PASA in your project area and point it to your database server"

nextgenusfs commented 6 years ago

I think the --ploidy 2 option should work. We did this recently with a an diploid assembly (using falcon unzip assembler) and it seemed to work. There isn't a lot different about the pipeline with increasing the ploidy option, essentially it allows for more evidence mapping and alters how BUSCO results are parsed. But essentially it will predict genes on whatever contigs you pass to the script. I've not done this before myself so don't know exactly what the best options would be. Perhaps predicting just the primary first and then using those models as evidence for the haplotigs would be another option.

I think you may want to wait a few days if you can so you can use the new 1.0.0 version as there are lots of bug fixes, some optimization, and some additional tools. As soon as I can validate this stupid docker container, then I will release. If you can run docker containers, it will contain PASA/MySQL as well, so that could perhaps be an option.

nextgenusfs commented 6 years ago

PASA/Mysql are not explicitly required, although they are required for funannotate train and funannotate update. You can run those steps outside of funannotate though, i.e. on another computer that has PASA/mysql installed. The size of the resulting Mysql database depends on the genome size, I don't know exactly how big the databases get.

Jennie17 commented 6 years ago

Thanks for your immediate response.

The genome size is around 100-200 Mbp, will this size be possible to make Mysql database work on a desktop?

Looking forward to your new release :)

nextgenusfs commented 6 years ago

I think it should but don't know until you try. The docker container would be an easy place to see if it works (if I can get the stupid thing to build/function correctly).

Jennie17 commented 6 years ago

OK, let me know when it is released!

Merry Christmas and Happy New Year!

Best wishes!

Jennie17 commented 6 years ago

BTW, what's the command like for using primary models for haplotigs?

Perhaps predicting just the primary first and then using those models as evidence for the haplotigs would be another option.

nextgenusfs commented 6 years ago

You could use the primary prediction transcripts and the protein models in the predict_results directory as evidence for the haplotigs annotation. you can then also use the same augustus_species parameters and it should work fairly quickly. Merging the gene names won't be as easy and would require some scripting.

Jennie17 commented 6 years ago

"Merging the gene names won't be as easy and would require some scripting."

Do you mean merging the same gene predicted in both primary and haplotigs?

Any tools to recommand or a general idea/key steps as a clue for writing up my own script?

Will you incorporate this function in the near future in funannotate as well?

nextgenusfs commented 6 years ago

Just saying that I don't have anything right now that will take the locus_tag and automatically renumber them. Its trivial to add an option where gene numbering would start, i.e. at 1000 or something instead of FUN_000001, etc. I can add that now I think.

Jennie17 commented 6 years ago

Great if you can add this aspect in the new release.

nextgenusfs commented 6 years ago

Gene number is controlled with the --numbering option.