svm-zhang / AGOUTI

Annotated Genome Optimization Using Transcriptome Information
MIT License
20 stars 8 forks source link

Question regarding recommended contig lengths #8

Closed darencard closed 7 years ago

darencard commented 7 years ago

Just wondered if there is any guidance on what to aim for as far as contig sizes for using AGOUTI. This applies more specifically to the ab initio gene predictions, which will obviously be worse with smaller contig sizes, but is also a consideration with using these models to run AGOUTI. Any insights are much appreciated.

svm-zhang commented 7 years ago

Hello Daren,

This is a great and definitely an open question.

Assemblers create many these small contigs from 100bp to 1kb, sometimes even less than 100bp. In my opinion, the smallest contigs for AGOUTI needs to be at least longer than the read length just for the sake of reliable read-mapping. However, this requires a bit caution to do, as the many of these small floating pieces in the assemblies are products of genomic regions of high heterozygosity, low complexity, etc. This means that you can have two copies of the same gene: one on a larger contigs, and the other one on a smaller one (in case of one heterozygous loci being split into two loci because the assemblers cannot distinguish). If such case happens, AGOUTI will erroneously merge these two split gene models into one. So far for all the simulated and real datasets I have been playing with, I use AGOUTI on contigs with a minimum of 1kb long. This choice follows what ALLPATHSLG uses.

I have had one immature idea (and a pretty naive one) for a while to benchmark on this. Instead of making the hard cut at a certain length and ignore all the contigs shorter than the cutoff, I think it would be better to try multiple cutoffs, and pick the cutoff that have the lowest number of alignments between the sequences less than the cutoff and the rest. Therefore running gene prediction on these sequences is less likely to have the same gene models somewhere else in the assembly (I think).

I hope this makes a bit sense. Let me know.

Also thanks for using AGOUTI.

Simo

svm-zhang commented 7 years ago

hello @darencard, feel free to reopen this if you have further questions.

darencard commented 7 years ago

Hi Simo,

Thanks for the prompt and thorough reply to my question. I'm sorry, I'm finally just seeing it. Good advice overall and I will keep it in mind as I start experimenting with AGOUTI.

Best, Daren