How to exclude genes from out.cntTable OR extract counts for specific TE family

mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.

http://hammelllab.labsites.cshl.edu/software/#TEtranscripts

GNU General Public License v3.0

229 stars 30 forks source link

How to exclude genes from out.cntTable OR extract counts for specific TE family #149

Closed farhan-lab closed 1 year ago

farhan-lab commented 1 year ago

Hi,

Thanks alot to developers of such a great tool for TE analysis. I have run TEcount on multiple samples and got the output files out.cntTable. As the output file also has count data for both genes and TEs, and my desired output is to focus certain repeats families. I wonder how may I exclude genes count from this list and keep repeats only? Also, whether I can extract a sub-set of this dataset for each specific TE family such as L1 or ERV ? I think if I can do grep command to match the transcript ID column from gtf file with first column of out.cntTable ?

Thanks, Best regards, Farhan

olivertam commented 1 year ago

Dear Farhan,

Thank you for your interest in the software. If you are hoping to perform differential analysis, we highly recommend keeping the genes to ensure proper calculation of sample variance. If you're just interested in TE, you could always grep for the colon (:). For L1, you could do the following:

$ grep -e ":L1" out.cntTable > L1.txt

For ERV, it depends on which ones you want, but you could potentially capture most of them this way:

$ grep -e ":ERV" out.cntTable > ERV.txt

Hopefully this is helpful. Thanks.

farhan-lab commented 1 year ago

Many Thanks for your recommendation to keep both genes and TE together for differential analysis.

As I am doing correlation matrix heatmap (Pearson's correlation coefficients) to see data clusters for my control and samples. Keeping both genes and TE in out.cntTable nicely separated two clusters for control and samples in my analysis, but I would like to check this correlation separately for genes and TEs. I was thinking if there is any way to exclude genes from repeats using the gtf file?

Thanks for the grep commands. It really worked for me to separate the specific TE.

Best regards, Farhan

olivertam commented 1 year ago

Hi Farhan,

You should be able to extract just the genes or TE using grep:

$ head -n 1 out.cntTable > header.txt
$ sed '1d' out.cntTable | grep ":" | cat header.txt - > TE.txt
$ sed '1d' out.cntTable | grep -v ":" | cat header.txt - > genes.txt

Please let me know if that does not address your issue.

Thanks.

farhan-lab commented 1 year ago

Hi Oliver,

I test your commands on my out.cntTable file which has total 231135 lines (repeats + genes count).

The extracted TE.txt file output only 1311 lines, which seem that the command could not full extract all repeats/TEs. While the extracted genes.txt output remaining 231135 lines and because we expect around 63,494 genes according to T2T human genome.

Thanks, Farhan

olivertam commented 1 year ago

Hi Farhan,

Might I ask which gene GTF are you using? You're right that 231k lines is much more than expected. If you want to provide a copy of the count table, I can also check if there's an error. The 1311 lines in the TE file is as expected, as there are about that number of TE subfamilies in the human genome (and TEcount aggregates by subfamily).

Thanks.

farhan-lab commented 1 year ago

Hi Oliver,

I used the same GTF files as youprovided in another issue I downloaded the GTF files for genes from UCSC browser and TEs from your provided link. I merged both GTF for RNA-seq mapping using STAR, and run TEcount for each of my samples Please see the attached an example of my sample out.cntTable file.

Thanks, Best, Farhan example_sample_1.out.cntTable.txt

olivertam commented 1 year ago

Hi Farhan,

I've noticed something odd with your count table. For every gene symbol, there seems to at least four or five "variants" to it (each with a -2NN, where N is a number from 0-9). This is definitely inflating the gene list. I'm wondering if there is an issue with your gene GTF. You're welcome to provide an excerpt if you like.

You can download the latest T2T (hs1 in UCSC) Refseq GTF here, and see if this improves the run.

Note: you don't have to include the TE GTF for STAR alignment, though you are welcome to. It does improve the ability to detect some "spliced" TE transcripts, but they are mainly trying to handle intra-TE insertions.

Another alternative is to remove the -2NN (e.g. withsed) and then group the names together (e.g. with groupBy from bedtools). It doesn't completely fix all the entries (7SK and rRNA are still weird), but it cleans up most of the errors

$ head -n 1 out.cntTable > header.txt
$ sed '1d;s/-[0-9]\+\"/\"/;' out.cntTable | sort -k1,1 | groupBy -g 1 -c 2 -o sum | cat header.txt - > grouped.cntTable

I've attached a fixed version of your count table.

Thanks

farhan-lab commented 1 year ago

Thanks so much !!! I will try to run again with Refseq GTF and see if it improves. I will also check your alternative solution to remove the -2NN and group names. I will update here.

farhan-lab commented 1 year ago

Hi Oliver,

Sorry for updating late. I was checking both solutions, it seems the Refseq GTF file has also multiple variants for each gene symbol as I count the total lines in hs1.ncbiRefSeq.gtf are 4667175.

I also tried you second fix to group the names together, and it works very well on a two columns "out.cntTable" file. I was wondering whether there is a way to do same on merged count table with multiple counts columns (around 42 columns corresponding to multiple samples)? I tried to modify the "-c" option in groupBy but it gives an error: ***** ERROR: Requested column 2, but database file - only has fields 1 - 0.

I have attached an example of my merged file.

Thanks for your help. Best, Farhan Example_merged_count.txt

olivertam commented 1 year ago

Hi Farhan,

Yes, that is true. However, TEtranscripts only looks at the gene_id of the GTF, and there are about ~57k unique gene_id. I'll be happy to check your other GTF to see if it's using the transcript_name for its gene_id value.

With your example file:

$ head -n 1 Example_merged_count.txt > header.txt
$ sed '1d;s/-[0-9]\+//' Example_merged_count.txt | sort -k1,1 | groupBy -g 1 -c 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43 -o sum | cat header.txt - > grouped_merged_count.txt

I have attached the output here.

Thanks.

farhan-lab commented 1 year ago

Hello Oliver,

Sorry for late update on the issue.

Thanks so much for the Refseq GTF file. It already resolved my problem. The sed commands were also very useful to extract genes and repeats from the count files in my analysis. You may close this issue.

Many thanks, Best regards, Farhan