Reproducing results for 5'CAGE data

Hi there

First of all thank you for your work on publishing the code and pipeline.

I was wondering if you could share more details on how you have processed your data. I have downloaded the data you generated for COV413A cell line and processed it according to your pipeline. Of course, some additional preprocessing steps were necessary, including generating individual fastq files from interleaved format, running STAR and STRINGTIE. These are the candidate transcripts you recover (Supplementary Table 9 filtered on COV413A):

Transcript ID | Class | Family | Subfam | Chr TE | Start TE | End TE | Location TE | Gene | Splice Target | Strand | Cell Line | CAGE TPM -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- TCONS_00027238 | DNA | hAT-Charlie | MER1B | chr12 | 130340312 | 130340636 | intron_1 | PIWIL1 | exon_2 | + | COV413A | 0,396505544 TCONS_00034780 | LINE | L1 | L1PA2 | chr14 | 71842964 | 71848996 | Intergenic | RGS6 | exon_2 | + | COV413A | 3,105960093 TCONS_00055478 | LINE | L1 | L1PA2 | chr18 | 34552378 | 34558395 | Intergenic | DTNA | exon_2 | + | COV413A | 0,660842573 TCONS_00086600 | LINE | L1 | L1PA2 | chr3 | 58842154 | 58848179 | Intergenic | FAM3D | exon_2 | - | COV413A | 0,396505544 TCONS_00098838 | LINE | L1 | L1PA2 | chr5 | 102671229 | 102677260 | Intergenic | SLCO6A1 | exon_2 | - | COV413A | 0,72692683 TCONS_00103663 | LINE | L1 | L1PB1 | chr6 | 7347074 | 7349650 | intron_8 | CAGE1 | exon_9 | - | COV413A | 0,396505544 TCONS_00107032 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | AC005281.1 | exon_2 | + | COV413A | 0,72692683 TCONS_00107035 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | AC005281.1 | exon_5 | + | COV413A | 0,72692683 TCONS_00107037 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | SCIN | exon_2 | + | COV413A | 0,72692683 TCONS_00116734 | LINE | L1 | L1PA2 | chr8 | 66949103 | 66955119 | intron_3 | TCF24 | exon_4 | - | COV413A | 0,660842573 TCONS_00119408 | LINE | L1 | L1PA2 | chr9 | 94089082 | 94095103 | intron_4 | PTPDC1 | exon_5 | + | COV413A | 0,330421286 TCONS_00070187 | LTR | ERV1 | LTR7 | chr2 | 38086114 | 38086512 | Intergenic | CYP1B1 | exon_2 | - | COV413A | 15,92630601 TCONS_00074167 | LTR | ERV1 | LTR2B | chr20 | 15985767 | 15986246 | intron_13 | MACROD2 | exon_14 | + | COV413A | 0,396505544 TCONS_00089490 | LTR | ERV1 | LTR2B | chr4 | 37546188 | 37546669 | intron_1 | C4orf19 | exon_2 | + | COV413A | 0,859095345 TCONS_00105271 | LTR | ERVL | LTR18A | chr6 | 79313214 | 79313548 | Intergenic | HMGN3 | exon_1 | - | COV413A | 2,841623064 TCONS_00016149 | SINE | Alu | AluY | chr10 | 101729855 | 101730163 | Intergenic | FBXW4 | exon_1 | - | COV413A | 0,991263859 TCONS_00016150 | SINE | Alu | AluY | chr10 | 101729855 | 101730163 | Intergenic | FBXW4 | exon_2 | - | COV413A | 0,991263859 TCONS_00030551 | SINE | Alu | AluJo | chr12 | 121847358 | 121847535 | intron_9 | HPD | exon_10 | - | COV413A | 0,330421286 TCONS_00041268 | SINE | Alu | AluY | chr15 | 51603584 | 51603891 | intron_1 | DMXL2 | exon_2 | - | COV413A | 1,652106432

I recover these - sorry for the truncated output.

I have used the hg38 reference genome and gtf, your reference data download and and your pre-defined arguments.txt. I hope we together can get to the bottom of why I don't recover any of the same TE chimers as you.

Best regards Nanna

twlab / TEProf2Paper

Reproducing results for 5'CAGE data #6