sagnikbanerjee15 / Finder

A fully automated gene annotator from RNA-Seq expression data
MIT License
51 stars 14 forks source link

Why the busco scores for the annotated example arabidopsis thaliana genome proteins are very low ? #39

Open bioinformaticspcj opened 2 years ago

bioinformaticspcj commented 2 years ago

Dear the authors,

Thanks a lot for your valuable and user-friendly software. I have tried the finder programe to annotate the example arabidopsis thaliana genome, but found the proteins extrated from FINDER_BRAKER_PROT.gtf file only achieved 54% BUSCO scores of embryophyta_odb10. It is very low compared with the genome itself BUSCO scores 99.3% with the same database and means near half of coding genes were not annotated. I do not know why. The command I used to run finder is as fellows: finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n 30 -gdir_star $PWD/star_index_without_transcriptome -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -gdir_olego olego_index -preserve 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error

Could you give me some advice to improve the performence of the annotation?

Thanks again for your reading. Looking forward to your timely respons. Best, Bob

sagnikbanerjee15 commented 2 years ago

Hello @bioinformaticspcj,

Thank you so much for your interest in finder. The reason why you see "poor" models is that the example is meant to ensure that the pipeline runs without any errors. It is not intended to produce usable annotations since the example is far from exhaustive. To obtain a better annotation, please supplement the example with more RNA-Seq samples. That will be your best fighting chance!!

Now that you have mentioned BUSCO, it makes me wonder if we should include that in our next release.

Please let us know if you have any other questions or concerns.

Thank you.

bioinformaticspcj commented 2 years ago

Dear Sagnik Banerjee,

Thanks for your timely response. I found the proteins extracted from the braker.gtf file achieved 97.6% BUSCO score, does that mean finder relies more on transcriptome evidence than braker2 does?

Thanks. Best, Bob

------------------ 原始邮件 ------------------ 发件人: "sagnikbanerjee15/Finder" @.>; 发送时间: 2021年12月1日(星期三) 晚上10:42 @.>; @.**@.>; 主题: Re: [sagnikbanerjee15/Finder] Why the busco scores for the annotated example arabidopsis thaliana genome proteins are very low ? (Issue #39)

Hello @bioinformaticspcj,

Thank you so much for your interest in finder. The reason why you see "poor" models is that the example is meant to ensure that the pipeline runs without any errors. It is not intended to produce usable annotations since the example is far from exhaustive. To obtain a better annotation, please supplement the example with more RNA-Seq samples. That will be your best fighting chance!!

Now that you have mentioned BUSCO, it makes me wonder if we should include that in our next release.

Please let us know if you have any other questions or concerns.

Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

sagnikbanerjee15 commented 2 years ago

Hello @bioinformaticspcj,

Yes finder is designed to give more importance to RNASeq evidence. It will remove those braker predictio s which do not have associated protein evidence. Since the peptode file provided in the example was just a small collection, it removed several of the genes predicted by braker. In the next version we are considering keeping those annotations and reporting them as such.

Please let me know if I can answer any other questions that you may have.

We are planning to release a docker version soon that will eliminate the issues with installation.

Thank you.