hillerlab / CESAR2.0

MIT License
28 stars 10 forks source link

Annonated genes have structural error. #12

Closed lizihe21 closed 4 years ago

lizihe21 commented 4 years ago

Hi,Mr Hiller I have finished the pipeline.I found there are a number of genes with in-fram stop codon or lack of init codon in final result.Is that OK and i just do a filtering or there are something wrong in my process. I used cattle genome as the reference.The gb file was generated by the filtered NCBI gff.The genome alignment was done by lastz according the UCSC 'whole genome alignment how to' tutorial.

MichaelHiller commented 4 years ago

Excellent. Those are likely cases where CESAR cannot find a start or stop codon or where the genome has an in-frame stop codon. You could manually inspect for a few cases the alignments. Also, you may want to check whether in-frame stop codons in your assembly are really supported by sequnencing reads or rather assembly base errors. This can be especially a problem for PacBio based assemblies (which are either not Illumina polished, or where no/few reads mapped to this locus).

lizihe21 commented 4 years ago

Thanks a lot for reply so soon. I will check the alignment and genome again.

lizihe21 commented 4 years ago

I notic a case where most part of an annoted gene have no information in the maf,and this gene got the in-frame stop codon.It seems that the unaligned region will be annotated and usually cause the wrong gene stucture.

MichaelHiller commented 4 years ago

In multi-exon mode, CESAR 2 tries to find all exons in the given region in the query. If some exons do not exist (because of a deletion or because there is an assembly gap), the underlying HMM will output the 'best' alignment it can find, even though this may be random sequence. In these cases, you have to filter the exons for alignment quality.

Alternatively, run CESAR in single exon mode only for exons that do align. (This will then miss the exons that are too diverged at the nucleotide level, but where CESAR finds the exon using its codon alignment).