Creation of custom GTFs for TEs and how are solo Vs. full length TEs dealt with?

milcs40 commented 2 years ago

Hello,

First of all, thank you so much for this set of tools your lab shares. Really valuable.

I'm a "seasoned" molecular biologist, now giving my first few steps on bioinformatics. I've been recently involved in studying TEs' expression in different physiological and pathological contexts, and I have been using either your software suite, or your custom GTFs for my analysis.

I have a couple of questions regarding your "custom made" GTF files for TEs:

Could you share, roughly, how these are made? What kind of processing you do, from downloading the GTF from UCSC Repeatmasker, to the final custom file?
How do you deal with TEs that are annotated in a "fragmented" manner? This is, it's my impression that a full length hERV (e.g. and HERVH) will have an annotation for the LTR part, as well as the -int part. As such, when counting the number of reads mapping to this particular TE, we are counting twice, for both the LTR region and the internal region. Do you somehow concatenate, or deal with these elements in your GTF files?

Best wishes, Miguel

olivertam commented 2 years ago

Hi Miguel,

Thank you for your interest in the software. I'll be happy to answer your questions. My apologies for my long response below.

1) I download the RepeatMasker table (Variation and Repeats -> rmsk) from UCSC using the Table Browser, choosing the genome build that I want, and getting all fields from selected table. I then use the following perl script, makeTEgtf.pl to process the downloaded table to make the GTF file.

I've included the help/usage info:

 Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column] 
                     -o [strand column] -n [source] -t [TE name column] 
                     (-f [TE family column] -C [TE class column] -1)
                     [INFILE]

 Output is printed to STDOUT

 Required parameters:
  -c [chrom column]     -    Column containing chromosome name
  -s [start column]     -    Column containing feature start position
  -e [stop/end column]  -    Column containing feature stop/end position
  -o [strand column]    -    Column containing strand information (+ or -)
  -t [TE name column]   -    Column containing TE name
  [INFILE]              -    File name to be processed into GTF

 Optional parameters:
  -n [source]           -    Source of the TE information 
                             (e.g. mm9_rmsk for RepeatMasker track from
                              mm9 mouse genome)
                             Defaults to "user-provided" if not specified
  -f [TE family column] -    Column containing TE family name. 
                             Defaults to TE name if not specified
  -C [TE class column]  -    Column containing TE class name. 
                             Defaults to TE family name if not specified
  -S [score column]     -    Column containing the score of the TE prediction
                             (e.g. score from RepeatMasker)
  -1                    -    Input coordinates uses 1-based indexing
                             This should be used if the input file uses
                             1-based coordinates. This should be invoked
                             if the genomic coordinates are obtained from
                             a GFF3, GTF, SAM or VCF file
                             Default: off if using BED, BAM or UCSC rmsk
                                      input files

In brief, the script takes each line from the table, and given the indicated columns for the various inputs, converts them into a TE GTF. The script currently removes low complexity and simple repeats, and a variety of "structural" RNA (e.g. snoRNA, tRNA) that are in the RepeatMasker output. We have debated whether we would remove Alpha/Beta Satellite Repeats, but we have kept them for now. The "key" thing that we do for the GTF is to add the TE family and TE class as the family_id and class_id fields, which we need in order for TEtranscripts. I would be happy to go through the Perl script in more detail, in case you want to rewrite it in a language that works better for you.

2) You raise a great question that we're still grappling with. You are absolutely correct that LTR elemets are typically "fragmented" into the LTR and internal regions. This is unfortunately due to the "mix and match" between the LTR and internal sequences for various elements, and thus was easier computationally to predict them separately. There are also issues with older TE (e.g. some L1) where another TE (e.g. Alu) has inserted within, and thus treated as two "instances".

To quickly answer your concern about "double-counting", a read that overlaps both the LTR and internal region will be initially assigned to both, but will be redistributed (via Expectation Maximization/EM algorithm) such that the read is "split" between the two annotations according to their relative abundance (based on other overlapping reads). Thus, the read would be only counted once with respect to the LTR and internal, though each portion might get a different "fraction" of that read.

This is not a trivial issue in our mind. For the internal scenario, we could potentially assess the predicted portion of the TE and check that the "split" TE pieces are adjacent to each other (based on the consensus sequence). This is probably doable, but would be somewhat manual and possibly subjective (and I'm avoiding scenarios of internal duplication). For LTR, there needs to be a robust curation of the various "fragments" to determine if they "belong" to the same element, or if they were found adjacent to each other by "chance". There are databases, such as HERVd which has tried to concatenate LTR elements into distinct "elements". In most cases, they are looking for adjacencies between the fragments (e.g. LTR next to internal, which is then next to LTR), and assign them as either soloLTR, internal sequence only, partially flanked LTR element, or fully flanked LTR element. We have taken a look at their database, and it could be converted to a format that is compatible with TEtranscripts. However, since they only work on LTR elements, you would need to add other TE back in (e.g. LINE, SINE).

Long story short: we completely agree that using RepeatMasker's annotation is imperfect (especially for LTR as you have outlined). However, we don't feel that we have sufficient expertise to manually curate the RepeatMasker output to definitively "concatenate" the LTR and internal regions such that they represent biologically meaningful element (especially when we have very little orthogonal information that could confirm whether our concatenated element is a transcriptional unit). Add the variation between different organisms, we decided to stick with a simpler model that could be easily applied to different genomes, though with the caveats that you have raised.

Please don't hesitate to contact me if you have other questions, concerns or comments.

Thanks.

milcs40 commented 2 years ago

Dear Oliver,

Thank you so much for your helpful answer. Really clear and enlightening. I don't know much about perl, but the script is written in a way that is really easy to follow.

Regarding the "fragmented" TEs, your answer makes perfect sense. I guess that, in the end of the day, only a manual curation would be able to precisely identify full(er) vs. truncated TEs. And even then, it would be difficult to judge TEs that have had other TEs inserting into them. As such, I do think that the "simpler model" is a good compromise.

When I raised this issue, I was thinking about certain tools like the "One code to find them all" http://doua.prabi.fr/software/one-code-to-find-them-all. I confess I didn't properly investigate the tool and I think it's likely geared towards LTRs... But I was wondering whether you ever considered using something similar to build your TEs' GTFs.

In the end of the day, I do think that there might be different biological relevance for full length vs truncated TEs, and that's the main reason I brought this question to you.

Thank you again for your valuable input. Best wishes, Miguel

olivertam commented 2 years ago

Hi Miguel,

Thank you for pointing us to One code to find them all. We will probably take a look at their approach (and their output) and see if it works sufficiently well for us. I definitely agree that there are probably different biological relevance between full length vs truncated (especially those embedded within gene introns). However, I suspect that the only way to really address their differences is to determine if these full-length vs fragments are actually transcribed under their own promoter, or merely read-through from nearby transcriptional units. That, I'm afraid, will require long read sequencing (which I feel the field is trying to do).

Thanks again for your interest in the software, and the great discussion.

milcs40 commented 2 years ago

Thank you for your answer and insight.

I do have two extra questions, regarding how TEtranscripts work (one following our discussion, and one somehow unrelated):

Do you think that it makes sense to distinguish between read-through Vs. own promoter transcription for TE units? Does TEtranscripts implement (or can you suggest) a way to keep only those TE units that can be "isolated" from the nearby transcriptional units you mention on your previous post?
Somehow unrelated: when performing differential expression analysis, do you use a single count table (this is, do you concatenate genes and TEs), or perform DEA for genes and TEs, separately? If you fuse, how do the read count distribution of the TEs (particularly when counting for classes) affects the general read count distribution per gene?

Thank you again... not only for the software, but also for your answers and reactiveness! Cheers, Miguel

olivertam commented 2 years ago

Hi Miguel,

Apologies for the long response.

1) This has been a major issue of debate within the TE field. The prevailing idea is that the TE "transcription" we're seeing in RNA-seq are not really driven by the transposable element, but are just read-through from other transcriptional unit. Thus, any alteration in their levels should be attributed to the transcriptional unit, rather than the TE itself. In contrast, if it is shown that the TE is driving their own transcription, then any changes in expression is actually associated with the altered regulation of the TE. As you can imagine, this can influence the interpretation of the differential analysis. One of the major difficulty in trying to "isolate" TE from nearby transcriptional unit is that we really have a very poor understanding of the "basal" transcription of the genome. Gene models are largely built upon mature mRNA sequences (from RT experiments), and thus might not account for downstream sequences that were also transcribed, but not processed the same way. Furthermore, the annotation of "transcriptional units" is ever changing, with long non-coding RNA becoming a significant part of the transcriptional output. Based on GENCODE annotations, we found that some long non-coding RNA are composed/derived from TE (sometimes with >70% of their exonic sequences overlapping TE), and thus trying to separate them might be very difficult. As such, we have not implemented anything that could "isolate" TE, especially when "gene" annotations are constantly updating.

2) For differential analysis, we are using a single count table (both genes and TE), and perform DEA together using DESeq2. With that approach, the read count distribution between genes and TE in a sample is less important than the their distribution between samples, so if the read count distribution/dispersion of TE and genes are "similar" from sample to sample, DESeq2 appears to be able to handle the normalization. Furthermore, since we're aggregating on the sub-family (e.g. L1HS) level, the number of TE "entries" is much smaller (~900 for human) than the number of gene entries (~20000), so the dispersion calculation would probably be more affected by gene count distribution.

Hope this answers your questions. Let me know if you have others.

Thanks.

milcs40 commented 2 years ago

Dear Oliver,

It's been a while since we have last been in touch. I realize I didn't thank you for your last answers... Thank you so much! Your answers, knowledge and help are invaluable.

I am coming back to you regarding the makeTEgtf.pl script. As I previously mentioned, I am really a novice in programming and I really have no knowledge on Perl. Nevertheless, I've tried to go through your script and rewrite it in R, as I thought this would be a good exercise/challenge for me and would give me more flexibility in creating my own TE annotations. As such, I created an R function, makeGTF, to take either an UCSC table, or a table directly from repeatmasker, and produce a GTF file. Mind you, this is really "naive" code! :)

I think I followed most of the steps on your script. Nevertheless, when I merge the gencode GTF file and my custom made GTF file and use the merged GTF for featureCounts, I have a very low percentage of successfully assigned alignments (~5%), with the majority of the reads being flagged as "Unassigned_Ambiguity". When I do the same, merging the gencode GTF with the files available in Molly's website, I get ~80% of assigned alignments.

I know this is a bit of a long shot, but I was wondering if you have any idea on why this could be happening?

olivertam commented 2 years ago

Hi,

I was looking at the last line of your code, and it appears to be loading a GTF as the input. Is that intentional, or is it just the naming (so the input is either a UCSC table or RepeatMasker output)? Also, how are you outputting the file? I'd be happy to take a quick look at the output and see if there's an obvious issue.

Thanks

milcs40 commented 2 years ago

Dear Oliver,

Thank you so much for your reply. Indeed, the last line is just an example of how I run the function above (the makeGTF function). You are correct in your interpretation: it's just the naming. The function will take either an UCSC table or the RepeatMasker output. I just wrongly named the UCSC table with a .gtf extension.

Regarding the output, after running the function, I have three new tables: a standard GTF file, a GTF file for analyzing repeat families (basically, changing 'gene_id', in the attributes columns, to 'gene_name', as this is what I'm counting in the gencode GTF) and a GTF for analyzing individual repeat instances (changing the ´transcript_id´ to ´genename´, for the same reason above). In addition, I add a ´TE´ prefix, to allow better visualization and analysis of differentially expressed TEs during my DEA.

In any case, looking at the outputs, I can't figure out why this is happening. Here's the first 10 lines of your GTF (subset for chr1 and sorted by position): mHammellGTF

And here's the first 10 lines of my GTF (sorted similarly): customGTF

I'm also adding the head of the tables I create, in case this can be useful figuring this out. mHammell.hg38.rmsk.mod_head.txt custom.hg38.rmsk.mod_head.txt

Thank you again for all the help and all you provide to the community! Best wishes

olivertam commented 2 years ago

Hi,

Thank you for your response. I think the simplest explanation is that you still have to keep the gene_id and transcript_id fields in the attributes, and just add gene_name as an additional field. My suspicion is that even for algorithms like featureCounts, they are still expecting the standard GTF format (which has gene_id and transcript_id), but just uses the gene_name for quantification/output purposes. Let me know if re-adding the gene_id and/or transcript_id back to the GTF files works. If not, I'll be happy to dig deeper.

Thanks.

milcs40 commented 2 years ago

Hello Oliver,

I don't think that's it. When I tested both my GTF, and the one I've downloaded from your site, I changed the attribute in your file, similarly to what I do in my custom file: mHammellGTF_class$V9 <- gsub("gene_id \"", "gene_name \"TE_", as.character(mHammellGTF_class$V9))

After fusing the gencode GTF with either of the two files above (with gene_name instead of gene_id), I get, as I mentioned, ~5% Vs. ~80% of successfully assigned alignments (for my custom GTF Vs. yours).

Thank you so much!

olivertam commented 2 years ago

Hi,

Thank you for your quick response.

Would you mind sharing your modified GTF? I can't see any obvious differences in the first few lines, but I wonder if there's something downstream that was altered.

One thing that I want to check is whether you're using the FASTA file from GENCODE for alignment, or hg38 from UCSC. Although the canonical chromosomes are named the same, the scaffolds and alternate haplotypes in GENCODE follow the Ensembl nomenclature, and thus needs conversion to hg38 (or vice versa). We do have a GTF that uses the GENCODE chromosome nomenclature. I don't think that this is the cause of the issue, but it is something that I think you might want to be aware of if you're looking at non-canonical chromosomes (e.g. chromosomes not starting with chr in GENCODE)

Thanks.

milcs40 commented 2 years ago

Dear Oliver,

In fact, I have been looking at the annotations of the chromosomes. And indeed, I did come across that issue. Nevertheless, even if that is relevant, I think it wouldn't justify having only 5% of successfully assigned alignments, don't you think?

In any case, I'll leave you here the links for the files I am merging:

The GENCODE file I'm using, is the comprehensive gene annotation file: gencode.v39.annotation.gtf
Here is the link to my unmodified GTF file (this is, with the attribute names not changed yet): hg38_rmsk_unmodified.gtf

olivertam commented 2 years ago

Hi,

Unfortunately, I cannot access your GTF file, as it requires access for Google drive.

Thanks.

milcs40 commented 2 years ago

I think you should be able to access it now: hg38_rmsk_unmodified.gtf

Cheers

olivertam commented 2 years ago

Hi,

Sorry for the slow response. I think I might have found the issue. For some genomic coordinates (especially in later chromosomes like chr10), the start position is in scientific notation.

chr10   hg38_rmsk       exon    1.14e+08        114000096       210     -      .gene_id "L3"; transcript_id "L3_dup30251"; family_id "CR1"; class_id "LINE"
chr10   hg38_rmsk       exon    1.16e+08        116000000       282     +      .gene_id "MER33"; transcript_id "MER33_dup5964"; family_id "hAT-Charlie"; class_id "DNA"
chr10   hg38_rmsk       exon    1.16e+08        116000063       341     +      .gene_id "MER33"; transcript_id "MER33_dup5965"; family_id "hAT-Charlie"; class_id "DNA"
chr10   hg38_rmsk       exon    1.3e+08 130000391       1151    +       .      gene_id "L1MD1"; transcript_id "L1MD1_dup4593"; family_id "L1"; class_id "LINE"
chr10   hg38_rmsk       exon    1.32e+08        132000048       2149    +      .gene_id "MER4E1"; transcript_id "MER4E1_dup659"; family_id "ERV1"; class_id "LTR"
chr10   hg38_rmsk       exon    1.32e+08        132000351       2214    +      .gene_id "AluSx4"; transcript_id "AluSx4_dup6674"; family_id "Alu"; class_id "SINE"

As a result, that causes the programs to ignore those annotations, since it can't automatically translate that notation to numeric. Fixing these should help.

Thanks.

milcs40 commented 2 years ago

Hi Oliver,

How did you spot that? Amazing! I am not sure that's the culprit, though. I looked at your gtf files, and for that entry, it seems that the start position is also on the same notation.

Tomorrow, I'm going to redo featureCounts, using the gencode file merged with:

The custom gtf from Molly's lab: hg38_rmsk_TE_20200804.gtf
A gtf file created from the full table of the repeatmasker downloaded from UCSC Table Browser, using your makeTEget.pl script
A gtf file using the same repeatmasker UCSC table, using my R script.

I'll report back on the findings. Sorry to be such an hassle. I guess that, ultimately, this might be helpful/useful for someone more familiar with R language!?

Last question: What's the difference between the different custom gtf files available for download in your lab's resources? Namely, what are the differences between:

hg38_rmsk_TE.gtf
GRCh38_GENCODE_rmsk_TE.gtf
GRCh38_Ensembl_rmsk_TE.gtf

Thank you so much.

olivertam commented 2 years ago

Hi,

How did you open the GTF file? I'm opening it on the command line (less) and see that it is not in scientific notation. I wonder if opening it in R converted it to scientific notation, and thus might have introduced the formatting differences.

The difference between the custom GTF for hg38/GRCh38 is the chromosome nomenclature. hg38 uses chrX for all the chromosomes, including the scaffolds and patches (which have their own nomenclature). Ensembl uses X (without the chr), and the contig names for the scaffolds and patches. GENCODE uses the hg38 format for canonical chromosomes (e.g. chr1), but follows Ensembl for the scaffolds and patches. Thus, depending on which genome build you're using for alignment, you would want to match it with the corresponding GTF file to ensure that you're quantifying all annotations.

Thanks.

milcs40 commented 2 years ago

Dear Oliver,

I think I have figured it out... mostly, thanks to you!

I was opening the TE tables (either GTFs or annotations) with fread. This was fine. The issue had to do with adding +1 to the start positions: it was transforming the data type from integer, to numeric. And when exporting this as a table, the numeric was being saved with scientific notation, which featureCounts wasn't recognizing!

Now, when merging the gencode.v39 GTF with my custom made GTF, and counting for TE classes, I get the results I was expecting:

Load annotation file gencode.v39.Repeatmasker_hg38_rmsk_custom_TEClass ...
   Features : 6350795                                                     
   Meta-features : 61541                                                  
   Chromosomes/contigs : 454                                              

Process BAM file D165T29Aligned.sortedByCoord.out.bam...                  
   Strand specific : reversely stranded                                   
   Paired-end reads are included.                                         
   The reads are assigned on the single-end mode.                         
   Total alignments : 218638341                                           
   Successfully assigned alignments : 160077775 (73.2%)                   
   Running time : 0.13 minutes

Moreover, using your makeTEgtf.pl script, gives me exactly the same results as with my R script. Quite happy! Thank you so much. Wouldn't have been able to do this without your help!

Last question: I used only the main annotation file from Gencode, without non canonical scaffolds and patches (as I used the UCSC hg38 genome, for alignment). I've tried looking for a tool to convert from the Gencode/Ensembl annotation, to the UCSC hg38 format, but didn't find anything very helpful. Do you know of a way to do this?

Thank you so much for your patience and help, that has been invaluable.

olivertam commented 2 years ago

Hi,

Happy to hear that the issue is resolved (you're right that the +1 calculation in R converted the datatype, I should have remembered that). I find the chromAlias.txt from UCSC to be quite useful as the source of nomenclature. For GENCODE, they typically use the alias (column 2) for the scaffolds and patches, so if you want to convert to the hg38 UCSC nomenclature, you can just do something like this (note, there might be errors in this as I haven't actually run the code in R, but hopefully the concept makes sense):

conversion <- fread("hg38.chromAlias.txt")
colnames(conversion) <- c("UCSC","Ensembl","NCBI","Refseq")

 ## Separate GTF into canonical and non-canonical chromosome entries
canonical <- GTF[grep("chr",GTF[,1]),]
noncanonical <- GTF[grep("chr",GTF[,1],invert=T),]

## changing column 1 of the non-canonical GTF entries by matching it to column 2 of the conversion table
## (assuming it's GENCODE/Ensembl), and then selecting column 1 from the conversion table (which is the UCSC id)
noncanonical[,1] <- conversion[match(conversion[,2],noncanonical[,1]),1]

## Recombine the two parts
GTF <- rbind(canonical, noncanonical)

Let me know if you have any questions.

Thanks.

milcs40 commented 2 years ago

Hello,

Thank you so much for all your help and inspiration.

Warmest wishes

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

mhammell-laboratory / TEtranscripts

Creation of custom GTFs for TEs and how are solo Vs. full length TEs dealt with? #103