twlab / TEProf2Paper

TEProf2 Pipeline used to find promoters and predict protein sequences from RNA-sequencing data
Other
18 stars 6 forks source link

New Gencode Version #18

Closed mpizzagalli777 closed 5 months ago

mpizzagalli777 commented 5 months ago

Hi, I have a question that was mostly answered elsewhere (#11). However, I was wondering if it would be possible to update the input files in order to get around this issue. In other words, if I followed the steps you outlined in the README, and create the new reference files with my version of gencode would the pipeline work? Or is it a requirement to use the v25 version of gencode?
Thanks so much!

mpizzagalli777 commented 5 months ago

Just to clarify, I am actually able to run through all of the following scripts, until the final commands max_speed.py step. All other steps worked correctly.

python $teproflib/rmskhg38_annotate_gtf_update_test_tpm.py \
    $wkdir/${sample}_alignment/${sample}_Aligned.sortedByCoord.out.gtf \
    $teproflib/arguments.txt

python $teproflib/annotationpmprocess.py \
    $wkdir/${sample}_TEProf2/${sample}_Aligned.sortedByCoord.out.gtf_annotated_filtered_test_all

Rscript $teproflib/aggregateProcessedAnnotation.R $wkdir/${sample}_TEProf2/${sample}_Aligned.sortedByCoord.out.gtf_annotated_filtered_test_all_c \
    -k yes -a $teproflib/arguments.txt

mkdir $wkdir/filterreadstats

python $teproflib/commandsmax_speed.py $wkdir/${sample}_TEProf2/filter_combined_candidates.tsv $wkdir/${sample}_alignment/

The output of head filter_combined_candidates.tsv -n 2 is as follows:

STRG.9484.3 None None None None None None None None None None None None Yes protein_coding PTN chr7 137343439 137343865 exon 1 chr7;137254971;137254973 137227341 137343865 137254859,137254974,137343439,137343865,137254975,137343438,137253638,137254858,137253464,137253637,137251230,137251391,137228076,137251229,137251392,137253463,137227341,137228075 ENST00000348225.6 chr7 137345817 137227352 -,chr7,137228075,137227352,137251391,137251230,137253637,137253464,137254974,137254859,137343699,137343439,137345130,137345066,137345817,137345647 None chr7 137345640 137346278 MER41B 4398 - - 119.900139 6.754386 22.363449 888.963501 165.807068 250.236212 SRR7820938_Aligned.sortedByCoord.out 1 0 0.134876335040892 0.0893693555431538 LTR ERV1 MER41B_137345640_None_None_None_PTN_exon_1_137227341_SRR7820938_Aligned.sortedByCoord.out 171 137345131 137345646 MER41B_137345640_None_None_None_PTN_exon_1_137227341 STRG.59.2 protein_coding ATAD3B chr1 1477351 1478643 intron 2 chr1;1471885;1471887 1471769 1496202 1487863,1487914,1495485,1496202,1471769,1472089,1490672,1495484,1490257,1490424,1479109,1480866,1487915,1489203,1486110,1486235,1490563,1490671,1485016,1485171,1478644,1478745,1478746,1479048,1486236,1486543,1482304,1482544,1477274,1477350,1486669,1487862,1485172,1485781,1482615,1485015,1482138,1482303,1485839,1486109,1482545,1482614,1479049,1479108,1490425,1490562,1472090,1477273,1486544,1486668,1489204,1489274,1489275,1490256,1485782,1485838,1480867,1480936,1477351,1478643,1480937,1482137 ENST00000308647.7 Yes protein_coding ATAD3B chr1 1479049 1479108 exon 4 chr1;1471885;1471887 1471769 1496202 1487863,1487914,1495485,1496202,1471769,1472089,1490672,1495484,1490257,1490424,1479109,1480866,1487915,1489203,1486110,1486235,1490563,1490671,1485016,1485171,1478644,1478745,1478746,1479048,1486236,1486543,1482304,1482544,1477274,1477350,1486669,1487862,1485172,1485781,1482615,1485015,1482138,1482303,1485839,1486109,1482545,1482614,1479049,1479108,1490425,1490562,1472090,1477273,1486544,1486668,1489204,1489274,1489275,1490256,1485782,1485838,1480867,1480936,1477351,1478643,1480937,1482137ENST00000308647.7 chr1 1477849 1496200 +,chr1,1477849,1478745,1479049,1479108,1480867,1480936,1482138,1482303,1482545,1482614,1485016,1485171,1485782,1485838,1486110,1486235,1486544,1486668,1487863,1487914,1489204,1489274,1490257,1490424,1490563,1490671,1495485,1496200 chr1,1478746,1479048,chr1,HAVANA,intron,1478746,1479048,.,+,.,gene_id ENSG00000160072.19; transcript_id ENST00000472194.6; gene_type protein_coding; gene_status KNOWN; gene_name ATAD3B; transcript_type retained_intron; transcript_status KNOWN; transcript_name ATAD3B-002; intron_number 1; exon_id ENSE00001943609.1; level 2; transcript_support_level 1; havana_gene OTTHUMG00000000577.5; havana_transcript OTTHUMT00000001370.3;,transcript_id ENST00000472194.6,14 chr1 1477677 1478003 AluSp 2149 - + 94.325249 251.117706 17.59329 94.325249 17.59329 18.252984 SRR7820938_Aligned.sortedByCoord.out 1 0 1 0.963858292978288 SINE Alu AluSp_1477677_ATAD3B_intron_2_ATAD3B_exon_4_1471769_SRR7820938_Aligned.sortedByCoord.out 897 1478746 1479048 AluSp_1477677_ATAD3B_intron_2_ATAD3B_exon_4_1471769

When I run commandmax_speed.py, the error is as follows: File "..../commandsmax_speed.py", line 35, in elementvecint = [int(x) for x I elementsvec.split(',')] ValueError: invalid literal for int() with base 10: 'None'

Do you see what the issue might be?

So sorry for the long follow up! Just wanted to provide some more context

mpizzagalli777 commented 5 months ago

In case someone has a similar question, this does not seem to be an issue with the Genecode version. Rather, it seems that using the -k flag on the aggregateProcessedAnnotation.R kept the TE transcripts only (as intended) but this caused issues for the commands later in the work flow due to the "None" values that are inserted.