Closed chenyangkang closed 1 year ago
Hi Yangkang,
TOGA names genes (more specific projections) as transcriptID.geneID.chainID. ChainID tells us which chain was used that aligns the reference and the query locus. Looks like you don't have an isoform file that assigns transcripts to genes, therefore geneID is empty. So 17 is the chain with ID 17. REFERENCE and QUERY tells you which species the seq is from.
2) Pls use https://github.com/hillerlab/TOGA/blob/master/supply/extract_codon_alignment.py @kirilenkobm Can we pls document that on github?
Thx Michael
@MichaelHiller Thanks a lot Dr. Hiller!
For the question 1, I think I did pass a isoform file downloaded from the APPRIS website for chicken, and further processed:
ENSGALG00010001237 ENSGALT00010002860
ENSGALG00010001212 ENSGALT00010002803
ENSGALG00010001182 ENSGALT00010002734
Besides, this is the project_args.json
file which I believe logged the parameter inputs:
{"chain_input": "CHICKEN.pieavo1.allfilled.chain", "bed_input": "/beegfs/store4/chicken_ref_seq/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.109.bed12.new",
"tDB": "CHICKEN.2bit", "qDB": "pieavo1.2bit", "project_dir": null,
"project_name": "test_test", "min_score": 15000, "isoforms": "/beegfs/store4/chicken_ref_seq/appris_data.principal.txt.new",
"keep_temp": true, "limit_to_ref_chrom": null, "nextflow_dir": null,
"nextflow_config_dir": null, "do_not_del_nf_logs": false, "cesar_bigmem_config": null,
"para": false, "para_bigmem": false, "chain_jobs_num": 2, "no_chain_filter": false, "orth_score_threshold": 0.5, "cesar_jobs_num": 2, "cesar_binary": null, "using_optimized_cesar": false,
"output_opt_cesar_regions": false, "mask_stops": true, "cesar_buckets": "0",
"cesar_exec_seq": false, "cesar_chain_limit": 100, "cesar_mem_limit": 16, "time_marks": null,
"u12": null, "stop_at_chain_class": false, "uhq_flank": 50, "o2o_only": false, "no_fpi": false,
"fragmented_genome": false, "ld_model": false, "annotate_paralogs": false, "mask_all_first_10p": false}
I sanity checked some of the transcript names and confirmed that the one in the isoform file can be found in the bed12 file. I think I actually co-filtered these two files to make sure they match.
Yangkang
Hmm, then I don't know why the gene symbol does not show up. @kirilenkobm Do you know why?
I think there is no gene symbol in the bed12 file..?
Z 86002865 86025052 ENSGALT00010002860 0 - 86003109 86024602 0 18 422,104,169,94,226,147,129,87,99,69,99,93,69,144,117,86,100,292, 0,1066,2784,3399,5151,6095,10946,12391,13334,13995,15561,16784,17943,20038,20739,21312,21673,21895,
Z 85811534 85919193 ENSGALT00010002803 0 + 85811643 85919176 0 10 155,166,198,65,129,176,130,102,87,94, 0,5298,9475,14118,15342,41448,61592,85325,101520,107565,
Z 85688003 85790362 ENSGALT00010002734 0 - 85688903 85787282 0 10 1346,118,96,36,171,126,44,702,123,81, 0,1771,6186,22423,33374,35214,36506,98594,101077,102278,
So I cannot think of sources where TOGA can extract symbols from. Am I wrong?
Yangkang
Good catch. This is the problem.
If you run TOGA with the isoform file, it will consider the different transcripts of a gene, but it will still print what is given in the bed12 file that lists the transcript coordinates.
Pls append the gene symbol to the transcript name in the bed file. For our human annotation, we have
g ENST00000420798 toga.transcripts.bed toga.isoforms.tsv
toga.transcripts.bed:chrX 54920562 54931429 ENST00000420798.TRO 0 + 54924990 54931020 0 11 316,105,64,80,92,80,43,63,115,2429,226, 0,3888,4107,4426,5029,5847,6020,6480,7104,8040,10641,
toga.isoforms.tsv:ENSG00000067445 ENST00000420798.TRO
Hi!
Great software. I wish to know the meaning of the header in the
codon.fasta
output:For example:
Question1: What does the ".17" mean here?
Background info: When I call
cat codon.fasta|grep '>'|cut -d'.' -f1|wc -l
, I got 64484. When I callcat codon.fasta|grep '>'|cut -d'.' -f1|sort -u|wc -l
, I got 18538.Question2: If the .17 means the starting position of the alignment, what is the best way to generate multi-species codon alignment for each gene, as mention in the Science paper (how to concatenate them, and pass them to MACSE)? And should I only include transcripts annotated as "one2one" in the "orthology_classification.tsv" for downstream comparative genomic positive selection analysis?
Thanks in advance! Yangkang