twlab / TEProf2Paper

TEProf2 Pipeline used to find promoters and predict protein sequences from RNA-sequencing data
Other
18 stars 6 forks source link

Reproducing results for 5'CAGE data #6

Open nannabarnkob opened 1 year ago

nannabarnkob commented 1 year ago

Hi there

First of all thank you for your work on publishing the code and pipeline.

I was wondering if you could share more details on how you have processed your data. I have downloaded the data you generated for COV413A cell line and processed it according to your pipeline. Of course, some additional preprocessing steps were necessary, including generating individual fastq files from interleaved format, running STAR and STRINGTIE. These are the candidate transcripts you recover (Supplementary Table 9 filtered on COV413A):

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Transcript ID | Class | Family | Subfam | Chr TE | Start TE | End TE | Location TE | Gene | Splice Target | Strand | Cell Line | CAGE TPM -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- TCONS_00027238 | DNA | hAT-Charlie | MER1B | chr12 | 130340312 | 130340636 | intron_1 | PIWIL1 | exon_2 | + | COV413A | 0,396505544 TCONS_00034780 | LINE | L1 | L1PA2 | chr14 | 71842964 | 71848996 | Intergenic | RGS6 | exon_2 | + | COV413A | 3,105960093 TCONS_00055478 | LINE | L1 | L1PA2 | chr18 | 34552378 | 34558395 | Intergenic | DTNA | exon_2 | + | COV413A | 0,660842573 TCONS_00086600 | LINE | L1 | L1PA2 | chr3 | 58842154 | 58848179 | Intergenic | FAM3D | exon_2 | - | COV413A | 0,396505544 TCONS_00098838 | LINE | L1 | L1PA2 | chr5 | 102671229 | 102677260 | Intergenic | SLCO6A1 | exon_2 | - | COV413A | 0,72692683 TCONS_00103663 | LINE | L1 | L1PB1 | chr6 | 7347074 | 7349650 | intron_8 | CAGE1 | exon_9 | - | COV413A | 0,396505544 TCONS_00107032 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | AC005281.1 | exon_2 | + | COV413A | 0,72692683 TCONS_00107035 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | AC005281.1 | exon_5 | + | COV413A | 0,72692683 TCONS_00107037 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | SCIN | exon_2 | + | COV413A | 0,72692683 TCONS_00116734 | LINE | L1 | L1PA2 | chr8 | 66949103 | 66955119 | intron_3 | TCF24 | exon_4 | - | COV413A | 0,660842573 TCONS_00119408 | LINE | L1 | L1PA2 | chr9 | 94089082 | 94095103 | intron_4 | PTPDC1 | exon_5 | + | COV413A | 0,330421286 TCONS_00070187 | LTR | ERV1 | LTR7 | chr2 | 38086114 | 38086512 | Intergenic | CYP1B1 | exon_2 | - | COV413A | 15,92630601 TCONS_00074167 | LTR | ERV1 | LTR2B | chr20 | 15985767 | 15986246 | intron_13 | MACROD2 | exon_14 | + | COV413A | 0,396505544 TCONS_00089490 | LTR | ERV1 | LTR2B | chr4 | 37546188 | 37546669 | intron_1 | C4orf19 | exon_2 | + | COV413A | 0,859095345 TCONS_00105271 | LTR | ERVL | LTR18A | chr6 | 79313214 | 79313548 | Intergenic | HMGN3 | exon_1 | - | COV413A | 2,841623064 TCONS_00016149 | SINE | Alu | AluY | chr10 | 101729855 | 101730163 | Intergenic | FBXW4 | exon_1 | - | COV413A | 0,991263859 TCONS_00016150 | SINE | Alu | AluY | chr10 | 101729855 | 101730163 | Intergenic | FBXW4 | exon_2 | - | COV413A | 0,991263859 TCONS_00030551 | SINE | Alu | AluJo | chr12 | 121847358 | 121847535 | intron_9 | HPD | exon_10 | - | COV413A | 0,330421286 TCONS_00041268 | SINE | Alu | AluY | chr15 | 51603584 | 51603891 | intron_1 | DMXL2 | exon_2 | - | COV413A | 1,652106432

I recover these - sorry for the truncated output. image

I have used the hg38 reference genome and gtf, your reference data download and and your pre-defined arguments.txt. I hope we together can get to the bottom of why I don't recover any of the same TE chimers as you.

Best regards Nanna

nakul2234 commented 1 year ago

Hello,

This pipeline is to be used with short-read (ideally paired-end) RNA sequencing data to help find potential TE promoters. The data that you downloaded was nanoCAGE data, which can help validate promoter locations. Thus, you should not use this pipeline on the nanoCAGE data itself. The nanoCAGE data will help define promoters accurately, but it will normally not be able to assemble the full-length transcript.

For details on how to process the nanoCAGE data, we have that in our Supplementary Methods section of the paper: https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-023-01349-3/MediaObjects/41588_2023_1349_MOESM1_ESM.pdf In addition, the following paper introduced the method and has more details on it: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1670-6

In addition, we used the cell lines to validate the TE-gene chimeras seen in the tumor samples. There could be TE-gene chimeras in the cell lines that were not part of our reference that could be new.

Best, Nakul