kkrismer / transite

RNA-binding protein motif analysis
https://transite.mit.edu
MIT License
10 stars 1 forks source link

Matrix-based motifs for other species #1

Closed Tato14 closed 4 years ago

Tato14 commented 5 years ago

Hi! I was wondering if you plan to include PWM for other species in order to run the analysis. I am very interested in mouse :-P Thanks!

kkrismer commented 5 years ago

Hi!

If you have PWMs for mouse RBPs, you can use them with both the website (https://transite.mit.edu) and the R / Bioconductor package. The website supports mouse RefSeq identifiers and MGI symbols.

If you don't have mouse-specific PWMs, you can use the human PWMs in the Transite database on mouse transcripts. Mouse and human RBP motifs seem to be fairly similar.

I'm not planning to add new motifs to the permanent Transite database in the new future, unless you have a set of motifs that you would like me to add.

Cheers, Konstantin

ShajahanAnver commented 4 years ago

Regarding analysing data from other species: RefSeq identifiers : Would I be able to use fission yeast if I have PWMs?

Thanks Shan

kkrismer commented 4 years ago

See #2

shrishdwivedi commented 1 year ago

Hi! Thanks for your work on this package. My apologies for re-opening an old issue, but I was using Transite and ran into this same issue (wanting to use mouse motifs for mouse transcripts in R).

My thinking was that I would build my own motifs object in R using mouse motifs, and that I would create each motif using the provided R functions create_matrix_motif or create_k-mer_motif. However, I noticed that any motif created using the create_k-mer_motif function does not have an associated PWM stored (which makes sense, as the PWM is not an argument to the function).

My question is that if I want to use run_kmer_tsma using my own motif object of custom mouse motifs, will there be an issue if the motifs are made using create_k-mer_motif? All I need are the k-mer enrichment results and the motif enrichment results. I am looking through the package and literature to answer this myself, but wanted to post in case you had any guidance as well.

Thank you very much.

kkrismer commented 1 year ago

Hi! That's correct! Just make sure all the k-mers you provide to create_kmer_motif are hexamers. Your custom motifs created by create_kmer_motif are compatible with both run_kmer_tsma and run_kmer_spma. It's true that the matrix property of motifs created by create_kmer_motif will not be set, but the position weight matrix is only used by run_matrix_tsma and run_matrix_spma.

By the way, the website also supports custom motifs, but only one at a time (both position weight matrices and lists of hexamers).

shrishdwivedi commented 1 year ago

That sounds great! Thank you for your reply. I am looking to utilize the R package as it will allow me to integrate Transite better into my workflow.

If I may, it would be great to confirm if the following process looks correct. If was to build my own motifs object using custom mouse motifs (using create_kmer_motif for each), it appears I need to follow these steps:

1. Decide which motifs to extract from either CISBP-RNA or RBPDB (or both)

3. For each motif, generate a list of k-mers

4. For each motif, use the generated list of k-mers to create a custom motif (create_kmer_motif)

5. Build 'mouse motifs' object using the custom motifs

Thank you again for the clarification and help. Please excuse any trivial questions on my part - I am still reading the documentation!

kkrismer commented 1 year ago

I haven't been able to determine quite yet how the motif selection was made from the paper methods, but if you have any recommendations, happy to follow. We removed motifs from CISBP-RNA because they also appeared in the RBPDB.

If your motif source is RBPDB or CISBP-RNA, you can also use their position frequency/weight matrices and import those directly (which will also set hexamers and heptamers), instead of generating hexamers yourself from the IUPAC sequence. I'd only set hexamers yourself if you have a specific set of hexamers you wanna use that's different from the hexamers that can be derived from the PWM.

If you wanna follow the same procedure we used from RBPDB and CISBP-RNA exports to Transite motifs, this is the code we used to initialize the Transite motifs:

library(tools)
library(dplyr)
library(stringr)
library(Biostrings)
library(transite)

init_motifs <- function() {
  message("processing CISBP-RNA motifs")

  cisbp_rna_dir <-
    "data-raw/CISBP-RNA_Homo_sapiens_2015_01_29_2-45_pm/"
  cisbp_rna_metainfo_file <-
    paste0(cisbp_rna_dir, "RBP_Information_all_motifs.txt")
  cisbp_rna_logo_dir <- paste0(cisbp_rna_dir, "logos_all_motifs/")
  cisbp_rna_motif_dir <- paste0(cisbp_rna_dir, "pwms_all_motifs/")
  cisbp_rna_motifs <-
    init_cisbp_rna_motifs(
      cisbp_rna_metainfo_file,
      cisbp_rna_logo_dir,
      cisbp_rna_motif_dir
    )

  message("processing RBPDB motifs")

  rbpdb_dir <- "data-raw/RBPDB_v1.3.1_human_2012-11-21_CSV/"
  rbpdb_metainfo_proteins_file <-
    paste0(rbpdb_dir, "RBPDB_v1.3.1_proteins_human_2012-11-21.csv")
  rbpdb_metainfo_experiments_file <-
    paste0(
      rbpdb_dir,
      "RBPDB_v1.3.1_experiments_human_2012-11-21.csv"
    )
  rbpdb_metainfo_mapping_file <-
    paste0(rbpdb_dir, "RBPDB_v1.3.1_protExp_human_2012-11-21.csv")
  rbpdb_motif_dir <- paste0(rbpdb_dir, "PFMDir/")
  rbpdb_motifs <-
    init_rbpdb_motifs(
      rbpdb_metainfo_proteins_file,
      rbpdb_metainfo_experiments_file,
      rbpdb_metainfo_mapping_file,
      rbpdb_motif_dir
    )

  motifs <- c(cisbp_rna_motifs, rbpdb_motifs)

  # remove all logos without RBP labels or length < 6
  idx <- sapply(motifs, function(motif) {
    return(!is.null(get_rbps(motif)) &&
      stringr::str_trim(get_rbps(motif)) != "" && get_width(motif) > 5)
  })
  motifs <- motifs[idx]

  # remove NULL entries
  motifs <- motifs[!sapply(motifs, is.null)]

  save(motifs, file = "data/motifs.rda", compress = "xz")
  invisible(motifs)
}

#' parsing CISBP-RNA motifs
init_cisbp_rna_motifs <-
  function(cisbp_rna_metainfo_file,
             cisbp_rna_logo_dir,
             cisbp_rna_motif_dir) {
    codes_map <- init_iupac_lookup_table()

    cisbp_rna_metainfo <-
      read.table(
        cisbp_rna_metainfo_file,
        sep = "\t",
        header = TRUE,
        stringsAsFactors = FALSE
      )
    cisbp_rna_motif_files <-
      list.files(
        path = cisbp_rna_motif_dir,
        full.names = TRUE,
        pattern = "txt"
      )

    valid_files <-
      apply(as.matrix(cisbp_rna_motif_files), 1, function(x) {
        file.info(x)$size > 0
      })
    cisbp_rna_motif_files <- cisbp_rna_motif_files[valid_files]
    counter <- 0
    cisbp_rna_motifs <-
      apply(as.matrix(cisbp_rna_motif_files), 1, function(file) {
        id <- as.character(basename(tools::file_path_sans_ext(file)))
        rbp_info <- filter(cisbp_rna_metainfo, Motif_ID == id)
        rbps <- stringr::str_trim(rbp_info$RBP_Name)
        species <-
          stringr::str_trim(gsub("_", " ", rbp_info$RBP_Species[1], fixed = TRUE))
        type <- stringr::str_trim(rbp_info$Motif_Type[1])
        src <- stringr::str_trim(rbp_info$MSource_Identifier[1])
        if (length(rbp_info$MSource_Version) > 0 &&
          !is.null(rbp_info$MSource_Version[1]) &&
          rbp_info$MSource_Version[1] != "" &&
          rbp_info$MSource_Version[1] != "NULL") {
          src <- paste(src, stringr::str_trim(rbp_info$MSource_Version[1]))
        }

        matrix <- read.table(file, header = TRUE, sep = "\t")
        matrix <- matrix[, !(names(matrix) == "Pos")]

        iupac <- generate_iupac_by_matrix(matrix, code = codes_map)

        # check whether zero probabilities are present
        if (!(min(matrix) > 0)) {
          # convert probabilities (PPM) to frequencies (PFM) - assuming smallest probability is count 1
          matrix <- matrix / min(matrix[matrix > 0])

          # Laplace smoothing / additive smoothing to avoid zeros
          matrix <- matrix + 0.25

          # convert absolute frequencies (PFM) to relative frequencies (PPM)
          matrix <- matrix / rowSums(matrix)
        }

        # converting relative frequencies (PPM) to weights (PWM)
        # assuming equiprobable nucleotides -> a priori probabilities: 0.25
        matrix <- log2(matrix / 0.25)

        length <- nrow(matrix)

        hexamers <- generate_kmers_from_iupac(iupac, 6)

        heptamers <- generate_kmers_from_iupac(iupac, 7)

        counter <<- counter + 1
        message(paste0("processed ", counter, " PSSMs"))
        return(new("RBPMotif", id = id, rbps = rbps, matrix = matrix,
                   hexamers = hexamers, heptamers = heptamers,
                   length = as.integer(length), iupac = iupac,
                   type = type, species = species, src = src))
      })

    # remove redundant motifs (motifs also present in RBPDB)
    redundant_motif_ids <-
      c(
        "M269_0.6",
        "M296_0.6",
        "M349_0.6",
        "M353_0.6",
        "M273_0.6",
        "M331_0.6",
        "M352_0.6",
        "M318_0.6",
        "M319_0.6",
        "M256_0.6",
        "M245_0.6",
        "M325_0.6",
        "M346_0.6",
        "M323_0.6",
        "M274_0.6",
        "M345_0.6",
        "M291_0.6",
        "M354_0.6",
        "M290_0.6",
        "M328_0.6",
        "M272_0.6",
        "M347_0.6",
        "M262_0.6",
        "M271_0.6",
        "M298_0.6",
        "M332_0.6",
        "M350_0.6",
        "M351_0.6",
        "M348_0.6",
        "M317_0.6",
        "M329_0.6",
        "M330_0.6"
      )

    non_redundant_idx <-
      sapply(cisbp_rna_motifs, function(motif) {
        return(!(get_id(motif) %in% redundant_motif_ids))
      })

    cisbp_rna_motifs <- cisbp_rna_motifs[non_redundant_idx]

    return(cisbp_rna_motifs)
  }

#' parsing RBPDB motifs
init_rbpdb_motifs <-
  function(rbpdb_metainfo_proteins_file,
             rbpdb_metainfo_experiments_file,
             rbpdb_metainfo_mapping_file,
             rbpdb_motif_dir,
             rbpdb_version = "v1.3.1") {
    codes_map <- init_iupac_lookup_table()

    rbpdb_metainfo_proteins <-
      read.table(
        rbpdb_metainfo_proteins_file,
        sep = ",",
        header = FALSE,
        stringsAsFactors = FALSE
      )
    rbpdb_metainfo_experiments <-
      read.table(
        rbpdb_metainfo_experiments_file,
        sep = ",",
        header = FALSE,
        stringsAsFactors = FALSE
      )
    rbpdb_metainfo_mapping <-
      read.table(
        rbpdb_metainfo_mapping_file,
        sep = ",",
        header = FALSE,
        stringsAsFactors = FALSE
      )
    rbpdb_motif_files <-
      list.files(
        path = rbpdb_motif_dir,
        full.names = TRUE,
        pattern = "pfm"
      )

    colnames(rbpdb_metainfo_proteins) <-
      c(
        "uniqueId",
        "annotId",
        "createDate",
        "updateDate",
        "geneName",
        "geneDesc",
        "species",
        "taxID",
        "domains",
        "flag",
        "flagNote",
        "aliases",
        "PDBIDs"
      )
    colnames(rbpdb_metainfo_experiments) <-
      c(
        "uniqueId",
        "expId",
        "exptype",
        "notes",
        "sequence_motif",
        "SELEX_file",
        "aligned_SELEX_file",
        "aligned_motif_file",
        "logo_file",
        "PWM_file",
        "PFM_file",
        "invivo_notes",
        "invivo_file",
        "secondary_structure",
        "flag"
      )
    colnames(rbpdb_metainfo_mapping) <-
      c("uniqueId", "proteinId", "expId", "homolog")

    valid_files <-
      apply(as.matrix(rbpdb_motif_files), 1, function(x) {
        file.info(x)$size > 0
      })
    rbpdb_motif_files <- rbpdb_motif_files[valid_files]

    counter <- 0
    rbpdb_motifs <-
      apply(as.matrix(rbpdb_motif_files), 1, function(file) {
        id <- as.character(basename(tools::file_path_sans_ext(file)))
        protein_id <-
          unlist(strsplit(id, "_", fixed = TRUE))[1]
        exp_id <- unlist(strsplit(id, "_", fixed = TRUE))[2]
        mapping <-
          filter(rbpdb_metainfo_mapping, proteinId == protein_id)
        if (nrow(mapping) != 1) {
          warning(paste0("protein id mapping not unique for id: ", protein_id))
        }
        unique_id <- stringr::str_trim(mapping$uniqueId[1])
        proteins_metainfo <-
          filter(rbpdb_metainfo_proteins, uniqueId == unique_id)
        if (nrow(mapping) != 1) {
          warning(paste0("unique id mapping not unique for id: ", unique_id))
        }

        experiments_metainfo <-
          filter(rbpdb_metainfo_experiments, expId == exp_id)
        if (nrow(mapping) != 1) {
          warning(paste0("exp id mapping not unique for id: ", exp_id))
        }

        rbps <- stringr::str_trim(proteins_metainfo$geneName[1])
        species <- stringr::str_trim(proteins_metainfo$species[1])
        type <- stringr::str_trim(experiments_metainfo$exptype[1])
        src <- paste("RBPDB", rbpdb_version)

        matrix <- read.table(file, header = FALSE)
        matrix <- t(matrix)

        # convert PPM to PFM if matrix is PPM
        if (!all(matrix %% 1 == 0)) {
          # PPM
          # convert probabilities (PPM) to frequencies (PFM) - assuming smallest probability is count 1
          matrix <- matrix / min(matrix[matrix > 0])
        }

        # check whether zero probabilities are present
        if (!(min(matrix) > 0)) {
          # Laplace smoothing / additive smoothing to avoid zeros
          matrix <- matrix + 0.25
        }

        # convert absolute frequencies to relative frequencies
        matrix <- matrix / rowSums(matrix)

        matrix <- data.frame(matrix)
        colnames(matrix) <- c("A", "C", "G", "U")
        rownames(matrix) <- 1:nrow(matrix)

        iupac <- generate_iupac_by_matrix(matrix, code = codes_map)

        # convert relative frequencies (PFM) to weights (PWM)
        # assuming equiprobable nucleotides -> a priori probabilities: 0.25
        matrix <- log2(matrix / 0.25)

        length <- nrow(matrix)

        hexamers <- generate_kmers_from_iupac(iupac, 6)

        heptamers <- generate_kmers_from_iupac(iupac, 7)

        counter <<- counter + 1
        message(paste0("processed ", counter, " PSSMs"))
        return(new("RBPMotif", id = id, rbps = rbps, matrix = matrix,
                   hexamers = hexamers, heptamers = heptamers,
                   length = as.integer(length), iupac = iupac,
                   type = type, species = species, src = src))
      })

    return(rbpdb_motifs)
  }

init_motifs()
shrishdwivedi commented 1 year ago

Hi! My apologies for the delayed response, I had missed this message. This is excellent - I had switched to using PPMs / PWMs, and my motif database is CISBP-RNA. This script will be great to validate that the object build was correct. Thanks again for your clarification and guidance!