Interpret the codon alignment results

chenyangkang commented 1 year ago

Hi!

Great software. I wish to know the meaning of the header in the codon.fasta output:

For example:

>ENSGALT00010053003.17 | CODON | REFERENCE
>ENSGALT00010053003.17 | CODON | QUERY

Question1: What does the ".17" mean here?

Question2: If the .17 means the starting position of the alignment, what is the best way to generate multi-species codon alignment for each gene, as mention in the Science paper (how to concatenate them, and pass them to MACSE)? And should I only include transcripts annotated as "one2one" in the "orthology_classification.tsv" for downstream comparative genomic positive selection analysis?

Thanks in advance! Yangkang

MichaelHiller commented 1 year ago

Hi Yangkang,

TOGA names genes (more specific projections) as transcriptID.geneID.chainID. ChainID tells us which chain was used that aligns the reference and the query locus. Looks like you don't have an isoform file that assigns transcripts to genes, therefore geneID is empty. So 17 is the chain with ID 17. REFERENCE and QUERY tells you which species the seq is from.

2) Pls use https://github.com/hillerlab/TOGA/blob/master/supply/extract_codon_alignment.py @kirilenkobm Can we pls document that on github?

Thx Michael

chenyangkang commented 1 year ago

@MichaelHiller Thanks a lot Dr. Hiller!

For the question 1, I think I did pass a isoform file downloaded from the APPRIS website for chicken, and further processed:

ENSGALG00010001237      ENSGALT00010002860
ENSGALG00010001212      ENSGALT00010002803
ENSGALG00010001182      ENSGALT00010002734

Besides, this is the project_args.json file which I believe logged the parameter inputs:

{"chain_input": "CHICKEN.pieavo1.allfilled.chain", "bed_input": "/beegfs/store4/chicken_ref_seq/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.109.bed12.new", 
"tDB": "CHICKEN.2bit", "qDB": "pieavo1.2bit", "project_dir": null, 
"project_name": "test_test", "min_score": 15000, "isoforms": "/beegfs/store4/chicken_ref_seq/appris_data.principal.txt.new", 
"keep_temp": true, "limit_to_ref_chrom": null, "nextflow_dir": null, 
"nextflow_config_dir": null, "do_not_del_nf_logs": false, "cesar_bigmem_config": null, 
"para": false, "para_bigmem": false, "chain_jobs_num": 2, "no_chain_filter": false, "orth_score_threshold": 0.5, "cesar_jobs_num": 2, "cesar_binary": null, "using_optimized_cesar": false, 
"output_opt_cesar_regions": false, "mask_stops": true, "cesar_buckets": "0", 
"cesar_exec_seq": false, "cesar_chain_limit": 100, "cesar_mem_limit": 16, "time_marks": null, 
"u12": null, "stop_at_chain_class": false, "uhq_flank": 50, "o2o_only": false, "no_fpi": false, 
"fragmented_genome": false, "ld_model": false, "annotate_paralogs": false, "mask_all_first_10p": false}

I sanity checked some of the transcript names and confirmed that the one in the isoform file can be found in the bed12 file. I think I actually co-filtered these two files to make sure they match.

Yangkang

MichaelHiller commented 1 year ago

Hmm, then I don't know why the gene symbol does not show up. @kirilenkobm Do you know why?

chenyangkang commented 1 year ago

I think there is no gene symbol in the bed12 file..?

Z   86002865    86025052    ENSGALT00010002860  0   -   86003109    86024602    0   18  422,104,169,94,226,147,129,87,99,69,99,93,69,144,117,86,100,292,    0,1066,2784,3399,5151,6095,10946,12391,13334,13995,15561,16784,17943,20038,20739,21312,21673,21895,
Z   85811534    85919193    ENSGALT00010002803  0   +   85811643    85919176    0   10  155,166,198,65,129,176,130,102,87,94,   0,5298,9475,14118,15342,41448,61592,85325,101520,107565,
Z   85688003    85790362    ENSGALT00010002734  0   -   85688903    85787282    0   10  1346,118,96,36,171,126,44,702,123,81,   0,1771,6186,22423,33374,35214,36506,98594,101077,102278,

So I cannot think of sources where TOGA can extract symbols from. Am I wrong?

Yangkang

MichaelHiller commented 1 year ago

Good catch. This is the problem.

If you run TOGA with the isoform file, it will consider the different transcripts of a gene, but it will still print what is given in the bed12 file that lists the transcript coordinates.

Pls append the gene symbol to the transcript name in the bed file. For our human annotation, we have

g ENST00000420798 toga.transcripts.bed toga.isoforms.tsv 
toga.transcripts.bed:chrX   54920562    54931429    ENST00000420798.TRO 0   +   54924990    54931020    0   11  316,105,64,80,92,80,43,63,115,2429,226, 0,3888,4107,4426,5029,5847,6020,6480,7104,8040,10641,
toga.isoforms.tsv:ENSG00000067445   ENST00000420798.TRO

hillerlab / TOGA

Interpret the codon alignment results #76