twlab / TEProf2Paper

TEProf2 Pipeline used to find promoters and predict protein sequences from RNA-sequencing data
Other
18 stars 6 forks source link

rmskhg38_annotate_gtf_update_test_tpm.py ValueError: invalid literal for int() with base 10: '1,transcript_id' #11

Closed songlyzz closed 10 months ago

songlyzz commented 11 months ago

Hi sir, Thank you to develop the powerful tool,I have a question,I currently have bam files that have been mapped using STAR and run string got the gtf files. the format like this: head PD1-to-CTLA4_Aligned.sortedByCoord.out.gtf chr1 StringTie transcript 14978 15907 1000 - . gene_id "STRG.1"; transcript_id "STRG.1.1"; cov "1.641618"; FPKM "0.303210"; TPM "0.745735"; chr1 StringTie exon 14978 15038 1000 - . gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; cov "1.278689"; chr1 StringTie exon 15796 15907 1000 - . gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "2"; cov "1.839286"; chr1 StringTie transcript 16959 17243 1000 - . gene_id "STRG.2"; transcript_id "STRG.2.1"; cov "3.268518"; FPKM "0.603702"; TPM "1.484784"; chr1 StringTie exon 16959 17055 1000 - . gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "1"; cov "3.525773"; chr1 StringTie exon 17233 17243 1000 - . gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "2"; cov "1.000000"; chr1 StringTie transcript 17316 17659 1000 - . gene_id "STRG.3"; transcript_id "STRG.3.1"; cov "6.261682"; FPKM "1.156545"; TPM "2.844484"; chr1 StringTie exon 17316 17368 1000 - . gene_id "STRG.3"; transcript_id "STRG.3.1"; exon_number "1"; cov "9.886792";

but when I run rmskhg38_annotate_gtf_update_test_tpm.py, some errors happened Traceback (most recent call last): File "/TEProf2Paper/bin/rmskhg38_annotate_gtf_update_test_tpm.py", line 661, in firstintronanno = annotateintron(chromosome, exon1end, exonstarts[1], strand) File "/TEProf2Paper/bin/rmskhg38_annotate_gtf_update_test_tpm.py", line 166, in annotateintron intronnums.append(int(i[3].split("; ")[8].split(" ")[1])) ValueError: invalid literal for int() with base 10: '1,transcript_id'

but a few file could run well, I do not know why it wrong. Hope you could help me!

Regards, Song

nakul2234 commented 11 months ago

Hello,

Thank you for your comments on the pipeline! I have not seen that error before, but it seems like there is a line that is not the normal gtf format that is not allowing the code to process it correctly.

Can you first confirm you are using pyrhon2?

Second, can you find the like that has '1,transcript_id' in it from the file? The code relies on the standard gtf format to extract things like exon number, so if the line is different from the others it may cause this error.

-Nakul

songlyzz commented 10 months ago

Hi Nakul, I checked my pipeline and found that this was caused by the annotation file and I changed the file using your v25 version Gencode file, this time it works well. Sincerely thanks for your help and wish you may provide the latest version of Gencode because the python don't work with it well. Regards, Song

nakul2234 commented 10 months ago

Hello,

I am glad it was able to be resolved! The latest versions of Gencode may be using a different format and thus does not work with our existing framework.

-Nakul