mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

chromatinRNA-seq compatibility #147

Closed max-biguet closed 9 months ago

max-biguet commented 10 months ago

Hi, I've been trying to run TEcount on some chromRNA-seq data but am getting a lot of lowly expressed genes which is affecting differential analysis. It is also possible that my samples are of poor quality but I want to rule out alternatives.

I have a question regarding the compatibility of your tool with chromRNA-seq. chromRNA-seq captures many different species of RNA, including some that include intronic sequences. So I wonder if, like other counting tools, it is possible that TEcount only uses "exon" annoation lines from the GTF files to assign reads to genes? If so, is there a way I could force it to use "gene" annotation lines?

Thanks for your help and insight in this matter.

All the best,

Max

olivertam commented 10 months ago

Hi Max,

Thank you for your interest in the software.

We have not tested TEcount on chromatin RNAseq datasets, but we are not completely surprised that there are strange results coming from this. If you truly believe that those low-expressed genes are "artifacts", you could always set an expression threshold during your differential analysis to filter them out (prior to performing the statistical analysis).

It is theoretically possible to force TEcount to use the gene annotation lines by changing gene to exon in the gene GTF file. It will affect the ability to assess intronic TE, as unambiguously mapped reads that overlap TE and gene introns will be assigned to the gene (whether they are truly gene-derived or not). You're welcome to try it out, though we cannot predict the impact on your quantification.

Thanks.

max-biguet commented 10 months ago

thanks for your quick reply.

I can see what you mean with intronic TEs if I were to force it to use gene annotations. To be honest with you, I am really only interested in TEs, is there a way to use maybe an empty gene GTF so that only TE reads are counted and this would then circumvent the issue (i.e. force TEcount to only count TEs)?

EDIT. I am now realizing that I misunderstood, the TE GTF only contains "exon" sequences right? so this would only potentially fix differential gene expression rather than TE expression

olivertam commented 10 months ago

Hi Max,

We don't recommend having an empty gene GTF, as there are cases where gene exonic sequences overlap TE annotation, and without the gene information, they could be erroneously classified as TE. I would lean towards having an expression "threshold" across multiple samples to eliminate "low-expressors/non-expressors", and then perform differential expression with the rest. However, we do understand that it's difficult to make that threshold, and might be inherent to chromatin RNAseq.

And yes, the TE GTF contains only exonic sequences. However, there is some interplay between gene and TE annotations (specifically on uniquely/unambiguously mapped reads), so changing the gene GTF does have impact on both gene and TE quantification (and thus downstream analysis).

Thanks.

max-biguet commented 10 months ago

just FYI, I ran:

awk 'BEGIN {OFS="\t"} $3=="gene" {$3="exon"} 1' input.gtf > modified_input.gtf

and TEcount then throws an error:

Error occured in line 6 of file genes_as_exons.gtf. Error in building gene index

line 6 is the first instance of a change from "gene" to "exon", seems like this causes an issue immediately. Maybe I should only keep these ligns so that there arent multiple "exon" entries per line...I will see if I can figure something out. But thanks for your help.

olivertam commented 10 months ago

Hi,

Actually, I suspect that the error is that there is no transcript_id on the line. I think it might be better to use the transcript lines rather than the gene lines. As long as the gene_id is the same, they would all be aggregated correctly.

$ awk 'BEGIN {OFS="\t"} $3=="transcript" {$3="exon"} 1' input.gtf > modified_input.gtf

Thanks.

olivertam commented 10 months ago

Hi,

If you want to provide me with the GTF file, I can try to see if why your modifications did not work.

Thanks.

max-biguet commented 10 months ago

hi,

I've fixed the issue, it was formatting, I still dont understand exactly the issue with the earlier command but at least I found a solution. For future reference this is how I've modified the GTF, it can probably be simplified by someone more competent than me but this is the only thing that worked for me upon inspection of the GTF.

awk 'BEGIN {OFS="\t"}; $3=="gene" {print; getline; print; getline; printf "%s\t%s\t%s\t%d\t%d\t%s\t%s\t%s\t", $1, $2, "exon", $4, $5, $6, $7, $8; for (i=9; i<=NF; i++) printf "%s ", $i; printf "\n"; next}; $3=="transcript" {print; getline; printf "%s\t%s\t%s\t%d\t%d\t%s\t%s\t%s\t", $1, $2, "exon", $4, $5, $6, $7, $8; for (i=9; i<=NF; i++) printf "%s ", $i; printf "\n"; next}; {next}' gencode.v38.annotation.gtf > genes_as_exons.gtf

which copies the "transcript" genomic coordinates into the following "exon" line and removes all subsequent exons for that transcript - I figure they aren't important considering the transcript coordinates bookend the exon coordinates anyways. The TEcount tool ran with this.

The Differential TE and Gene expression list seems to be a subset of running the TEcount tool as normal. I was expecting the opposite since I thought I would have more intronic reads in my data but maybe not. It was worth a shot anyways and may reflect more biologically relevant transcripts for me. There are still some TE families regulated and I suspect these must mostly not be intronic as like you said the software would probably count those for the corresponding gene, but interesting nonetheless.

Thanks for your help

olivertam commented 10 months ago

Hi,

I'm glad to hear that it worked. I wonder if it could work as follows:

awk 'BEGIN {FS="\t"; OFS="\t"}; $3=="gene"; $3=="transcript" {$3="exon";print}' gencode.v38.annotation.gtf > genes_as_exons.gtf

That way, I am only printing the lines with gene unchanged, and only lines with transcript where I switched transcript to exon. It would then also skip other lines that have exon.

I also wonder if by including introns, you're adding a lot more variability in the captured expression of some genes, such that the "difference" between your two conditions are no longer as distinct. That could explain why some of your previous hits might have dropped out.

Thanks.

max-biguet commented 10 months ago

Hi, yes sorry for the earlier hassle with awk, i dont know why but there was an issue with the tab separator. Happy to try your command at some point and report back.

I think you may be right concerning the variability, but to be honest the PCA looks pretty poor so maybe some experimental issues. I am planning to look at coverage tracks over TE families to confirm the TEcount output. Will let you know how it goes :)

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days