nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 84 forks source link

questions about tblastn/exonerate #112

Closed yuanning-li closed 6 years ago

yuanning-li commented 6 years ago

I am trying to use funannotate pipeline to annotate an annelid genome we just got. I have a question about log file. In you wiki introduction of the pipeline, it mentioned the pipeline will align uniprotkb to the genome using tblastn/exonerate after repeatmasking.

However, I didn't find that step in the log file or output files. Is that normal?

Here is my log file for your convenience.

2017-11-01 15:06:37,701: /gpfs01/tools/funannotate-0.7.2/bin/funannotate-predict.py -i genome.final.fa -o fun_out --species Lamellibrachia lumysi -
-organism other --busco_db metazoa --max_intronlen 20000 --pasa_gff genome.fasta.transdecoder.genome.gff3 --rna_bam RNA_seq.bam --transcript_eviden
ce transcripts.fasta --cpus 64

2017-11-01 15:06:38,429: OS: linux2, 64 cores, ~ 1026 GB RAM. Python: 2.7.12
2017-11-01 15:06:38,667: Running funannotate v0.7.2
2017-11-01 15:07:02,553: AUGUSTUS (3.0.1) detected, version seems to be compatible with BRAKER1 and BUSCO
2017-11-01 15:07:04,321: /tools/evidencemodeler-1.1.0/EvmUtils/gff3_gene_prediction_file_validator.pl fun_out/predict_misc/pasa_predictions.gff3
2017-11-01 15:07:37,716: Loading sequences and soft-masking genome
2017-11-01 15:07:37,717: Soft-masking: building RepeatModeler database
2017-11-01 15:08:02,490: Soft-masking: generating repeat library using RepeatModeler
2017-11-02 19:30:12,188: Soft-masking: running RepeatMasker with custom library
2017-11-03 06:42:50,352: rmOutToGFF3.pl genome.fasta.out
2017-11-03 06:46:41,153: Masked genome: 11,871 scaffolds; 687,711,696 bp; 3.70% repeats masked
2017-11-03 06:46:44,630: Aligning transcript evidence to genome with GMAP
2017-11-03 08:01:28,236: 359,453 transcripts aligned with GMAP
2017-11-03 08:01:28,237: Aligning transcript evidence to genome with BLAT
2017-11-03 08:01:28,237: blat -noHead -minIdentity=80 -maxIntron=20000 /gpfs01/home/yzl0084/Lamellibrachia/fun_annotate/predict/fun_out/predict_mis
c/genome.softmasked.fa fun_out/predict_misc/transcripts.combined.fa fun_out/predict_misc/blat.psl

Best, Li

nextgenusfs commented 6 years ago

Protein evidence mapping happens after aligning transcripts, which is not included in your log file here, is it still running? There is a second log file for the specifics of what is happening during that process, will be in logfiles/funannotate-p2g.log.

yuanning-li commented 6 years ago

Sorry, that makes sense.... The pipeline is still running, I just get confused because you listed aligning uniprotkb using tblastn right after repeatmasking...

"This command will first run RepeatModeler on your genome, soft-mask repeats using RepeatMasker, align UniProtKB proteins to genome using tblastn/exonerate"

Thanks a lot for your quick response, do you know if anyone else trying to use this pipeline for marine invertebrates?

Best, Li

nextgenusfs commented 6 years ago

I know the docs need some work, I have several new modules to incorporate around adding RNA seq and updating gene models with PASA. As well I'm working on some better functional annotation. When I get these updates done, I'm also then planning to write a decent manual at read the docs. Sorry it isn't more clear at the moment.

I don't know exactly what people have been using it for, I primarily work on fungi and use it for that, but I've tried to expand it based on users interest to be a more universal tool for eukaryotes. If you find that it works great or doesn't work well, let me know in any case and if it doesn't work well we can try to figure out how to make it work better for your organism. I wrote this due to my frustration with other pipelines being slow and gene models not conforming to NCBI annotation rules (mainly i'm referring to Maker here....).

yuanning-li commented 6 years ago

Yes, It would be great if that pipeline can be used for any eukaryotics, especially animals. I will let you know once it is finished so that you can take a look if that works properly.

Thanks a lot for putting this pipeline together!

Li Thanks On Nov 3, 2017, at 9:36 AM, Jon Palmer notifications@github.com<mailto:notifications@github.com> wrote:

I know the docs need some work, I have several new modules to incorporate around adding RNA seq and updating gene models with PASA. As well I'm working on some better functional annotation. When I get these updates done, I'm also then planning to write a decent manual at read the docs. Sorry it isn't more clear at the moment.

I don't know exactly what people have been using it for, I primarily work on fungi and use it for that, but I've tried to expand it based on users interest to be a more universal tool for eukaryotes. If you find that it works great or doesn't work well, let me know in any case and if it doesn't work well we can try to figure out how to make it work better for your organism. I wrote this due to my frustration with other pipelines being slow and gene models not conforming to NCBI annotation rules (mainly i'm referring to Maker here....).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/nextgenusfs/funannotate/issues/112#issuecomment-341721280, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKEF16i0OOKBqX5nF7wlW4a2t9zA48Ezks5syyTwgaJpZM4QRKSN.

yuanning-li commented 6 years ago

Dear Jon:

Hope you have a great thanksgiving break!

After trouble shooting funannotate pipeline for a couple weeks, we finally got the predict working on our cluster. The predict step finished without any error.

However, I got ~47,000 proteins in total during this step, the number is a lot higher for most animals (typically ~25,000). Any ideas how to further filter problematic genes before running annotation step?

Cheers, Li On Nov 3, 2017, at 9:45 AM, Yuanning Li yzl0084@tigermail.auburn.edu<mailto:yzl0084@tigermail.auburn.edu> wrote:

Yes, It would be great if that pipeline can be used for any eukaryotics, especially animals. I will let you know once it is finished so that you can take a look if that works properly.

Thanks a lot for putting this pipeline together!

Li Thanks On Nov 3, 2017, at 9:36 AM, Jon Palmer notifications@github.com<mailto:notifications@github.com> wrote:

I know the docs need some work, I have several new modules to incorporate around adding RNA seq and updating gene models with PASA. As well I'm working on some better functional annotation. When I get these updates done, I'm also then planning to write a decent manual at read the docs. Sorry it isn't more clear at the moment.

I don't know exactly what people have been using it for, I primarily work on fungi and use it for that, but I've tried to expand it based on users interest to be a more universal tool for eukaryotes. If you find that it works great or doesn't work well, let me know in any case and if it doesn't work well we can try to figure out how to make it work better for your organism. I wrote this due to my frustration with other pipelines being slow and gene models not conforming to NCBI annotation rules (mainly i'm referring to Maker here....).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/nextgenusfs/funannotate/issues/112#issuecomment-341721280, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKEF16i0OOKBqX5nF7wlW4a2t9zA48Ezks5syyTwgaJpZM4QRKSN.

nextgenusfs commented 6 years ago

Is your assembly haploid/diploid/polyploid? Based on the size you had before (650 Mb) and the log file says 37% masked repeats (note in v0.7.2 there is a bug in this number and is off by a factor of 10), if your assembly is more than haploid that number doesn't seem to be too ridiculous. Otherwise perhaps gene models are getting truncated. Can you compare gene models to another genome to see if they are indeed truncated? What do the BUSCO results look like?

yuanning-li commented 6 years ago

Hey Jon:

The genome is diploid but the assembly is haploid… The genome size and number of repeats is very similar from the estimation from short-paired reads using short paired-end reads.

There is actually no closely related genome to date, the most closely related species contained 32,389 genes. The Busco results also looked fine..

Any ideas how to compare gene models? I also compiled all the log files from funannotate for your convenience.

Best, Li C:91.1%[S:90.0%,D:1.1%],F:1.6%,M:7.3%,n:978

On Nov 25, 2017, at 11:02 PM, Jon Palmer notifications@github.com<mailto:notifications@github.com> wrote:

Is your assembly haploid/diploid/polyploid? Based on the size you had before (650 Mb) and the log file says 37% masked repeats (note in v0.7.2 there is a bug in this number and is off by a factor of 10), if your assembly is more than haploid that number doesn't seem to be too ridiculous. Otherwise perhaps gene models are getting truncated. Can you compare gene models to another genome to see if they are indeed truncated? What do the BUSCO results look like?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/nextgenusfs/funannotate/issues/112#issuecomment-346984685, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKEF11VEgt4jpB0awNafLHHc9naoPuurks5s6PD1gaJpZM4QRKSN.

nextgenusfs commented 6 years ago

Hi Li, It seems like the logfiles didn't attach correctly (you may need to do it on GitHub for it to work). Are the BUSCO results based on the proteome that you got from funannotate? These numbers would suggest that most models are indeed single copy and not repeated.

So a few things you can check: 1) how many PASA gene models are there? And were those the result of running the pasa_asmbls_to_training_set.dbi script? 2) you can also look at the number of Augustus and/or GeneMark models (a simple grep should work on the file in predict_misc/gene_predictions.gff3, i.e. grep -c $'\tGeneMark\t' gene_predictions.gff3 will give you number of GeneMark models.). The other gene models should have Augustus and then pasa_pred. If the number of gene models is similar between the different methods, it would be hard for me to imagine that they are all incorrect?

Perhaps it could be how the training was done, I think you leveraged RNA-seq mediated training, correct? The current tip of the funannotate repository has a lot of upgrades, one of which is RNA-seq mediated methods. I haven't done an official release yet as I'm trying to work/finish the documentation before doing a new release.

I'm not sure what else to look at. Perhaps looking at the logfiles will give me some more ideas.

Jon

yuanning-li commented 6 years ago

Hey Jon:

Thanks for your help, I will take a look at all the gene models right now. logfiles.zip

I have also uploaded the log file here to see if can find anything.

Best, Li

nextgenusfs commented 6 years ago

new version has been release v1.0.0, please reopen if these problems persist.