Create S. pombe transcriptome fasta and gff with full length UTRs

3mma-mack commented 3 years ago

For my project I am wanting to look at how specific genes are regulated, so will likely need gff and fasta files that include full UTRs rather than just a buffer. Pombase has separate fasta files for CDS, 5UTR and 3UTR regions, so I am going to take these and combine them to make new fasta and gff files. Doing this will allow me to look at uOFRs involved in regulation, particularly in genes such as FIL1. However, as the lengths of sequences surrounding the CDS will be inconsistent there may be some issues with the buffer parameter in the config file

3mma-mack commented 3 years ago

First step was to combine the three fasta files for the 5UTRs, cds sequence and 3UTRs into one file which for each gene had the sequences in the order 5UTR-CDS-3UTR. The original plan is shown in the pseudocode below:

# Take the 5UTR, CDS and 3UTR sequences  
# Stick them together 
# Produce a new fasta file that has sequences in the format 5UTR-CDS-3UTR

# inputs 
# S_pombe 5UTR fasta, CDS fasta, 3UTR fasta. 
# these should have sequences for individual genes/exons, labelled with IDs
# Downloaded from PomBase

# Process 
# check that sequences are labelled with IDs
# for loop to add 5UTR to CDS
    # for sequence i in CDS file
    # store the ID of sequence i
    # find the sequence with the matching ID in 5UTRs
         # some may not have 5UTRs, if there isn't then i ++
    # merge the sequences into the format 5UTR-CDS
    # store product in position i of new object 5UTR-CDS
    # return 5UTR-CDS
    # i ++
# repeat above for adding 3UTR to 5UTR-CDS
    # for sequence i in 5UTR-CDS file
    # store the ID of sequence i
    # find the sequence with the matching ID in 3UTRs
          # some may not have 5UTRs, if there isn't then i ++
    # merge the sequences into the format 5UTR-CDS-3UTR
    # store product in position i of new object 5UTR-CDS-3UTR
    # return 5UTR-CDS-3UTR
    # i ++

This was adapted and the fasta file was created and checked with the following code:

library(Biostrings)
library(rtracklayer)
library(stringr)
library(GenomicRanges)
library(parallel)
library(rhdf5)

three_UTR <- readDNAStringSet('3UTR.fa.gz')
CDS <- readDNAStringSet('cds.fa.gz')
five_UTR <- readDNAStringSet('5UTR.fa.gz')
cc <- 1
  fiveutr_CDS_threeutr <- list()
  # loop to add, if present, 5UTRs upstream of CDS sequences and 3UTRs downstream.
  # add names column 
  for (i in unique(names(CDS))){
    fiveutr_CDS_threeutr[[cc]] <- DNAStringSet(paste0(unlist(five_UTR[names(five_UTR)==i]),
        unlist(CDS[names(CDS)==i]), 
        unlist(three_UTR[names(three_UTR)==i])
        ))
    cc <- cc + 1 
  }
  fiveutr_CDS_threeutr <- unlist(DNAStringSetList(fiveutr_CDS_threeutr))
  names(fiveutr_CDS_threeutr) <- unique(names(CDS))

  store <- fiveutr_CDS_threeutr
  # check that the length of a sequence fiveUTR_CDS_seqlist is equal
  # to the combined lengths of seperate CDS and five_utr length for each gene 
  for (j in unique(names(fiveutr_CDS_threeutr))){
      if (width(fiveutr_CDS_threeutr[names(fiveutr_CDS_threeutr)==j])!= 
                                      (sum(width(five_UTR[names(five_UTR)==j])) +
                                          sum(width(CDS[names(CDS)==j])) + 
                                          sum(width(three_UTR[names(three_UTR)==j])))){
          print('Length of full sequence is not the sum of the length of the 3UTR, CDS and 5UTR')
          break
      } 
  }
# names used in gff and fasta have different format, adding '.1' to fasta to make consistent 
names(fiveutr_CDS_threeutr) <- paste(names(fiveutr_CDS_threeutr), '.1', sep = '')
# export fasta files with 3UTR, CDS and 5UTR sequences.  

writeXStringSet(fiveutr_CDS_threeutr,filepath = file.path('.','S_pombe_full_UTR.fasta'),format = "fasta")

3mma-mack commented 3 years ago

The next step was to update the ranges in the gff, using the gff in schizosaccharomyces/annotation as a starting point:

# Make GFF ranges match UTR
# gff, 5UTR, CDS, 3UTR.
# make names consistent 
# for each gene, if type == UTR5, update the range to be 1 to length of 5UTR for that gene
# if type == CDS, update range to be length of 5UTR +1 to length of 5UTR + CDS 
# if type == UTR3, update range to be  length 5UTR+CDS+1 to length of 5UTR+CDS+3UTR
# return updated gff 

genome <- readDNAStringSet('S_pombe_full_UTR.fasta',format = "fasta")
annot <- readGFFAsGRanges('Alex_files/Schizosaccharomyces_pombe_CDS_w_250utrs.gff3')

# make the names compatible to allow manipulation

names(five_UTR) <- paste(names(five_UTR), '.1', sep = '')
names(CDS) <- paste(names(CDS), '.1', sep = '')
names(three_UTR) <- paste(names(three_UTR), '.1', sep = '')

# seperate out the UTRs and CDS to allow processing 

gff_fiveutr <- annot[annot$type == 'UTR5']
gff_cds <- annot[annot$type == 'CDS']
gff_threeutr <- annot[annot$type == 'UTR3']

# Edit the ranges of the 5UTR gff to be 1-width of listed UTR for each gene, if present

  for(i in gff_fiveutr$Name){
    if(i %in% names(five_UTR)){
      ranges(gff_fiveutr[gff_fiveutr$Name == i]) <- IRanges(start = 1, 
                                                          width = width(five_UTR[names(five_UTR)==i]))
      }else{
         start(ranges(gff_fiveutr[gff_fiveutr$Name == i])) <- 0
         width(ranges(gff_fiveutr[gff_fiveutr$Name == i])) <- 1
         }
    }

# repeat for CDS, checking if a 5' UTR is present 

  for(i in gff_cds$Name){
    if(i %in% names(CDS)){
      if(i %in% names(five_UTR)){
        ranges(gff_cds[gff_cds$Name == i]) <- IRanges(start = end(ranges(gff_fiveutr[gff_fiveutr$Name == i])) +1,
                                                     width = width(CDS[names(CDS)==i]))  
         }else{
           ranges(gff_cds[gff_cds$Name == i]) <- IRanges(start = 1, width = width(CDS[names(CDS)==i]))  
            }
       }
    }

  for(i in gff_threeutr$Name){
    if(i %in% names(three_UTR)){
      ranges(gff_threeutr[gff_threeutr$Name == i]) <- IRanges(start = end(ranges(gff_cds[gff_cds$Name == i])) +1,
                                                            width = width(three_UTR[names(three_UTR)==i]))
       }else{
          ranges(gff_threeutr[gff_threeutr$Name == i]) <- IRanges(start = end(ranges(gff_cds[gff_cds$Name == i])), 
                                                            width = 1)
          }
    }

This created three GRanges objects which could be combined in the following ways, which are likely not the best way to combine them but what I could figure out for combining GRanges objects:

Pasted_gff <- GRanges(c(gff_fiveutr, gff_cds, gff_threeutr)) which lists all the 5UTRs, then all the CDSs, then all the 3UTRs

or by using:

c <- 1
total_gff <- GRanges()

for(i in 1:length(gff_cds)){
  total_gff[c]<- gff_fiveutr[i]
  total_gff[(c+1)] <- gff_cds[i]
  total_gff[(c+2)] <- gff_threeutr[i]
  c <- c+3
}

which produced a GRanges object with a repeating pattern of 5UTR of transcript x, CDS of transcript x, and 3UTR of transcript x.

The end GRanges object was exported with export.gff3(Pasted_gff, con=file.path('.','S_pombe_pasted_full_UTR.gff3'))

3mma-mack commented 3 years ago

I tried to check these files on eddie with check_fasta_gff however all runs, including gffs combined with either method metioned above, produced the following error:

(riboviz) [s1734289@node3a03(eddie) riboviz]$ python -m riboviz.tools.check_fasta_gff -f D-Sp_2018/original_genome_gff/S_pombe_full_UTR.fasta -g D-Sp_2018/original_genome_gff/S_pombe_full_UTR_removed_genes_pasted.gff3
Created by: RiboViz
Date: 2021-03-23 18:22:36.945192
Command-line tool: /exports/eddie/scratch/s1734289/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
File: /exports/eddie/scratch/s1734289/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
Version: commit 3e3cd2f1a9f4843170304c596149bbeb8c4e9549 date 2021-01-06 09:46:48+00:00

Checking fasta file D-Sp_2018/original_genome_gff/S_pombe_full_UTR.fasta
with gff file D-Sp_2018/original_genome_gff/S_pombe_full_UTR_removed_genes_pasted.gff3
Traceback (most recent call last):
  File "/exports/csce/eddie/biology/groups/wallace_rna/anaconda/envs/riboviz/lib/python3.7/runpy.py", line 193, i _run_module_as_main
    "__main__", mod_spec)
  File "/exports/csce/eddie/biology/groups/wallace_rna/anaconda/envs/riboviz/lib/python3.7/runpy.py", line 85, in_run_code
    exec(code, run_globals)
  File "/exports/eddie/scratch/s1734289/riboviz/riboviz/riboviz/tools/check_fasta_gff.py", line 57, in <module>
    invoke_check_fasta_gff()
  File "/exports/eddie/scratch/s1734289/riboviz/riboviz/riboviz/tools/check_fasta_gff.py", line 53, in invoke_chek_fasta_gff
    check_fasta_gff.check_fasta_gff(fasta, gff)
  File "/exports/eddie/scratch/s1734289/riboviz/riboviz/riboviz/check_fasta_gff.py", line 35, in check_fasta_gff
    sort_attribute_values=True)
  File "/exports/csce/eddie/biology/groups/wallace_rna/anaconda/envs/riboviz/lib/python3.7/site-packages/gffutilscreate.py", line 1292, in create_db
    c.create()
  File "/exports/csce/eddie/biology/groups/wallace_rna/anaconda/envs/riboviz/lib/python3.7/site-packages/gffutilscreate.py", line 507, in create
    self._populate_from_lines(self.iterator)
  File "/exports/csce/eddie/biology/groups/wallace_rna/anaconda/envs/riboviz/lib/python3.7/site-packages/gffutilscreate.py", line 589, in _populate_from_lines
    self._insert(f, c)
  File "/exports/csce/eddie/biology/groups/wallace_rna/anaconda/envs/riboviz/lib/python3.7/site-packages/gffutilscreate.py", line 530, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.InterfaceError: Error binding parameter 11 - probably unsupported type.

This suggests to me that one or both of the files have been made incorrectly, so troubleshooting is necessary

3mma-mack commented 3 years ago

Following a suggestion from @ewallace I have started creating an alternative version of the code to using tidy data frames instead of a GRanges object

3mma-mack commented 3 years ago

Having looked into the files and running individual genes through check_fasta_gff, it seemed that the problem was due to me setting the range for 5UTRs where no 5UTR was present to 0. To fix this I remade the files so that if the 5UTR was less than 50 or not present, a buffer would be added so all genes would have at least 50 nt upstream of the start codon.

This ran successfully through check_fasta_gff, and though the features_issues file showed a lot of incomplete features, looking at selected genes listed in the features_issues files in SnapGene showed they were as expected, with 1 CDS flanked by a 5'UTR and a 3'UTR. The coding sequence of genes checked matched that listed in Pombase. features_issues_09_04.txt

The files were also used for a run in riboviz. Though the run does fail during GenerateStatsFigs, this happens after the H5 and a few graphs have been produced. The graphs look good, with 3nt periodicity and most reads present being 28-32 nt lenght

3nt_periodicity.pdf read_lengths.pdf startcodon_ribogrid.pdf startcodon_ribogridbar.pdf

The output log for generateStatsFigs is also included. command_log_GenerateStatsFigs.txt

3mma-mack commented 3 years ago

In case anyone needs to check or change how the annotation files with the full UTR or 50 nt flank were made, the code used to create them can be found here.

3mma-mack commented 3 years ago

dos2unix has been used to prevent issues using files generated by windows in a Linux environment. I am going to try running these files through dos2unix as this may remove problems seen in features_issues_09_04.txt, and the files may then be ready to merge into main

riboviz / example-datasets

Create S. pombe transcriptome fasta and gff with full length UTRs #70