hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
151 stars 23 forks source link

Sharing annotation file #78

Closed chenyangkang closed 1 year ago

chenyangkang commented 1 year ago

Hi! It's me again :)

I'm combining more species based on the 501 bird codon alignment, but I found difficulty utilizing the codons that you shared in the Science paper, because the naming of transcript/gene seems to follow GeneBank(?), like 'rna-XM_025141352.1.fa'. While some have gene symbol prefixed, some are not, and some are in ensembl format. It would be greatly helpful if you can share the chicken bed12 file you used for annotation so that we can know all the symbol of the genes.

image

Thanks in advance!

Yangkang

MichaelHiller commented 1 year ago

Hi Yangkang,

correct, we merged Ensembl and RefSeq as the union of both had a higher completeness. The input annotation for chicken is already available: https://github.com/hillerlab/TOGA/tree/master/TOGAInput/chicken_galGal6

chenyangkang commented 1 year ago

@MichaelHiller Thanks Dr. Hiller! That was helpful. What are these genes started with "reg_"? Did you name them or can I find these information anywhere? image

Thanks!

Yangkang

MichaelHiller commented 1 year ago

Looks like we didn't have a gene symbol for those. The question is why. I'll ask.

MichaelHiller commented 1 year ago

I got an answer from Ekaterina. "These 9589 transcripts didn’t get a gene name because they were not annotated in the previous NCBI chicken annotation." that we produced in back in 2020/21. Therefore they just get an ID.

Looks like NCBI has now named many transcripts, so it could be worth updating the chicken annotation (compared to human / mouse, chicken has more room for improvement).

MichaelHiller commented 1 year ago

Ekaterina provided an updated file that used the current gene symbols to assign more transcripts a proper gene symbol. Please note that this is not the filtered transcript set (meaning it still has short intron, NMD and other not-proper transcripts), but maybe this is helpful for you as the transcriptID is the same. Pls see https://github.com/hillerlab/TOGA/tree/master/TOGAInput/chicken_galGal6/UnfilteredTranscriptsWithUpdatedGeneSymbols

As an alternative, you could use the latest NCBI refseq annotation https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/315/GCF_000002315.6_GRCg6a/ and produce a new input annotation.

chenyangkang commented 1 year ago

@MichaelHiller Thanks! This is greatly helpful! Appreciated!