salzman-lab / SICILIAN

GNU General Public License v2.0
19 stars 11 forks source link

Bug: create_annotator.py #14

Closed zyh4482 closed 2 years ago

zyh4482 commented 2 years ago

When using gencode vM29 mouse annotation gtf to create the three files, error message comes out.

  File "/home/zyh/software/SICILIAN/scripts/create_annotator.py", line 20, in get_exon_number
    return int(row["attribute"].split("exon_number")[-1].split("\"")[1].split(";")[0])
ValueError: invalid literal for int() with base 10: 'ENSMUSE00001343744.2'

After debugging, I found that the line int(row["attribute"].split("exon_number")[-1].split("\"")[1].split(";")[0]) does not correctly split the string. It returns the gene id instead of exon number, which produced this error.

str(gtf_df.loc[0,['attribute']]).split("exon_number")[-1].split("\"")[1] #report the gene id ENSMUSG00000102693.2 instead of exon number

zyh4482 commented 2 years ago

I forgot the example attribute string:

gene_id "ENSMUSG00000102693.2"; transcript_id "ENSMUST00000193812.2"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; exon_number 1; exon_id "ENSMUSE00001343744.2"; level 2; transcript_support_level "NA"; mgi_id "MGI:1918292"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

zyh4482 commented 2 years ago

The correct matching pattern should be: int(strings.split("exon_number")[-1].split(";")[0]).

To compare with gencode v40 hg38 gtf, I paste the attribute column here:

gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Did I do it correctly? The gtf headers are consistent between different species. Is it an issue about different release version or gencode source is not supported? Did you have the same issue before?

Thanks.

zyh4482 commented 2 years ago

Output example for gtf_df after correction:

seqname source feature start end score strand frame attribute transcript_id exon_number chr1 HAVANA exon 3143476 3144545 . + . "gene_id ""ENSMUSG00000102693.2""; transcript_id ""ENSMUST00000193812.2""; gene_type ""TEC""; gene_name ""4933401J01Rik""; transcript_type ""TEC""; transcript_name ""4933401J01Rik-201""; exon_number 1; exon_id ""ENSMUSE00001343744.2""; level 2; transcript_support_level ""NA""; mgi_id ""MGI:1918292""; tag ""basic""; tag ""Ensembl_canonical""; havana_gene ""OTTMUSG00000049935.1""; havana_transcript ""OTTMUST00000127109.1"";" ENSMUST00000193812.2 1

The terminal shows the following information. Three pickle files are created. Can you please check if the log indicates that the code successfully executed?

dumped exon boundary annotator to /home/zyh/ref/sicilian/gencode_vM29_exon_bounds.pkl dumped splice junctions annotator to /home/zyh/ref/sicilian/gencode_vM29_splices.pkl chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY dumped annotator to /home/zyh/ref/sicilian/gencode_vM29.pkl

Additional: I'll paste attribute column of different gtf version here.

gencode v30 gtf: gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

gencode v40 gtf: gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

roozbehdn commented 2 years ago

Hi zyh4482, thanks for digging into this bug. Yes, your log files are correct and you should have created 3 pickle files successfully by now. The issue was that the assumed format in the create_annotator script was based on having quotation marks for the number after exon number, i.e., the script assumed that the field for exon number is in fact exon_number "1". Your gtf file does not have "" for exon number and that was why you were getting the error from the original script. We have now fixed the script so that it can be run on either format for exon number.

zyh4482 commented 2 years ago

Great. Thanks for your reply.