steuernb / NLR-Annotator

NLR-Annotator upload
GNU General Public License v3.0
57 stars 24 forks source link

Question about differences among results from NLR-parser, NLR-annotator and Ren-seq #9

Closed b524198065 closed 4 years ago

b524198065 commented 4 years ago

Dear developers:

Thanks for developing this software. Recently I was trying to identify NLR genes in tomato reference genome Heinz 1706 SL4.0 and found that NLR-parser returned 242 NLR candidates using protein sequences as an input while NLR-annotator reported 292 by processing nucleotide sequences. Andolfo 2014 produced 326 NLR genes with 30 loci without NB-ARC domains in tomato Heinz 1706 by Ren-seq. So how can we explain the difference among them?

Since we have the whole-genome annotation for tomato SL4.0, do we really need to run the NLR-annotator pipeline to identify candiate NLR loci? (Maybe a simple run of NLR-parser using amino acid sequences is OK?) If we just extract the gene models within the candiate NLR loci generated from NLR-annotator, how about the potential loci without annotated gene models? Is it feasible and reliable to use de novo gene prediction software like AUGUSTUS to predict the ORF?

In this case of having relatively high-quality genome assembly (PacBio based) and gene annotation, what do you recommend me to perform the identification of NLR genes in a number of genome assemblies at a genome-wide level?

steuernb commented 4 years ago

Hi, I think, Andolfo et al. annotated manually, which in most cases if better than any automated pipeline. I have not checked where the differences come from. My assumption is that there are differences in the definition what would be an NLR and what not. For automatic pipelines you will always have a tradeoff between sensitivity and specificity. If you look into it and spot any patterns, please let us know; this will be valuable feedback. In my experience, a gene annotation is nothing set in stone. There is always room for improvement. Depending on the biological question you want to answer using a gene annotation you can use what is there as a general overview or you need to spend lot's of time curating the aspect you want to be certain of. Another aspect is if you might be interested in pseudogenes. Those might be functional alleles in other accessions than your reference genome. NLR-Annotaotor will give you those as well. In many cases, of course, this means you would not be able to predict any ORF on top of the pseudogene.

b524198065 commented 4 years ago

Hi steuernb,

Thanks for your reply. When I ran NLR-annotator on Arbidopsis TAIR10 genome and NLR-parser using Araport11 annotation something strange occrured: NLR-annotator returned 171 NLR candidate loci while NLR-parser reported 216, which is consistent with the result in the papers you published. What I expected was when processing DNA sequences, NLR-annotator should give us more candidates just like the situation in tomato (292 by NLR-annotator vs 242 by NLR-parser), because genome annotation is usually not perfect and some pseudo-genes might not be annotated but reported by the software. In this case, I suppose that many of the Araport11 proteins that NLR-parser treat as NLRs are actually partial annotated, so in the DNA-level result of NLR-annotator, these partial-annotated genes are merged as one complete loci, which suggests that manual curation of gene models is indeed highly necessary if we want to ensure the high quality of our NLR gene set.

Do you have any experiences on manually modifing gene annotation? Since the homologous protein sequences might not be that accurate (we don't have sufficient RNA-seq data), is it proper to merge/split gene models only according to the identified motif positions results by NLR-annotator?

Best, Hongbo

steuernb commented 4 years ago

Hi again, the difference is explained by the specificity of both programs. NLR-Parser will find single TIR, CC or LRRs, NLR-Annotator requires the trace of an NB-ARC domain to report something. We had to do this because otherwise we'd be overwhelmed by false positives. Some of the genes reported by NLR-Parser are false positives. They have LRRs, that's why they were found. They are likely to be involved in resistance and we prioritized sensitivity. But for screening genomes we needed to change that. I would not make a call on a gene model without RNA-seq data. Many biological questions can be answered without knowing the exact gene model. For the others, you need to invest in RNA-seq. You could enrich your transcripts for NLRs with RenSeq. That might give you the sequencing depth you need. All the best Burkhard

b524198065 commented 4 years ago

Thanks again!