Matrix-based motifs for other species

Tato14 commented 5 years ago

Hi! I was wondering if you plan to include PWM for other species in order to run the analysis. I am very interested in mouse :-P Thanks!

kkrismer commented 5 years ago

Hi!

If you have PWMs for mouse RBPs, you can use them with both the website (https://transite.mit.edu) and the R / Bioconductor package. The website supports mouse RefSeq identifiers and MGI symbols.

If you don't have mouse-specific PWMs, you can use the human PWMs in the Transite database on mouse transcripts. Mouse and human RBP motifs seem to be fairly similar.

I'm not planning to add new motifs to the permanent Transite database in the new future, unless you have a set of motifs that you would like me to add.

Cheers, Konstantin

ShajahanAnver commented 4 years ago

Regarding analysing data from other species: RefSeq identifiers : Would I be able to use fission yeast if I have PWMs?

Thanks Shan

kkrismer commented 4 years ago

See #2

shrishdwivedi commented 1 year ago

Hi! Thanks for your work on this package. My apologies for re-opening an old issue, but I was using Transite and ran into this same issue (wanting to use mouse motifs for mouse transcripts in R).

My thinking was that I would build my own motifs object in R using mouse motifs, and that I would create each motif using the provided R functions create_matrix_motif or create_k-mer_motif. However, I noticed that any motif created using the create_k-mer_motif function does not have an associated PWM stored (which makes sense, as the PWM is not an argument to the function).

My question is that if I want to use run_kmer_tsma using my own motif object of custom mouse motifs, will there be an issue if the motifs are made using create_k-mer_motif? All I need are the k-mer enrichment results and the motif enrichment results. I am looking through the package and literature to answer this myself, but wanted to post in case you had any guidance as well.

Thank you very much.

kkrismer commented 1 year ago

Hi! That's correct! Just make sure all the k-mers you provide to create_kmer_motif are hexamers. Your custom motifs created by create_kmer_motif are compatible with both run_kmer_tsma and run_kmer_spma. It's true that the matrix property of motifs created by create_kmer_motif will not be set, but the position weight matrix is only used by run_matrix_tsma and run_matrix_spma.

By the way, the website also supports custom motifs, but only one at a time (both position weight matrices and lists of hexamers).

shrishdwivedi commented 1 year ago

That sounds great! Thank you for your reply. I am looking to utilize the R package as it will allow me to integrate Transite better into my workflow.

If I may, it would be great to confirm if the following process looks correct. If was to build my own motifs object using custom mouse motifs (using create_kmer_motif for each), it appears I need to follow these steps:

1. Decide which motifs to extract from either CISBP-RNA or RBPDB (or both)

For this, I noticed you appeared to utilize both CISBP-RNA and RBPDB to generate 174 sequence motifs. I haven't been able to determine quite yet how the motif selection was made from the paper methods, but if you have any recommendations, happy to follow.

3. For each motif, generate a list of k-mers

Run generate_kmers_from_iupac for each motif (using IUPAC extracted from the database)
I tested generate_kmers_from_iupac using IUPAC codes for some of the existing motifs in the Transite database, and the corresponding hexamers that were generated were the same as those already stored in the motifs object. From the paper, it appears that the k-mers were generated using the PWM's instead (threshold 0.215), and I just wanted to clarify if I use the hexamers created from generate_kmers_from_iupac, the result will be the same. Otherwise, I believe I will need to extract the PFMs/PPMs from the databases and convert to PWMs first to then extract the k-mers using the same method.

4. For each motif, use the generated list of k-mers to create a custom motif (create_kmer_motif)

5. Build 'mouse motifs' object using the custom motifs

Thank you again for the clarification and help. Please excuse any trivial questions on my part - I am still reading the documentation!

kkrismer commented 1 year ago

I haven't been able to determine quite yet how the motif selection was made from the paper methods, but if you have any recommendations, happy to follow. We removed motifs from CISBP-RNA because they also appeared in the RBPDB.

If your motif source is RBPDB or CISBP-RNA, you can also use their position frequency/weight matrices and import those directly (which will also set hexamers and heptamers), instead of generating hexamers yourself from the IUPAC sequence. I'd only set hexamers yourself if you have a specific set of hexamers you wanna use that's different from the hexamers that can be derived from the PWM.

If you wanna follow the same procedure we used from RBPDB and CISBP-RNA exports to Transite motifs, this is the code we used to initialize the Transite motifs:

library(tools)
library(dplyr)
library(stringr)
library(Biostrings)
library(transite)

init_motifs <- function() {
  message("processing CISBP-RNA motifs")

  cisbp_rna_dir <-
    "data-raw/CISBP-RNA_Homo_sapiens_2015_01_29_2-45_pm/"
  cisbp_rna_metainfo_file <-
    paste0(cisbp_rna_dir, "RBP_Information_all_motifs.txt")
  cisbp_rna_logo_dir <- paste0(cisbp_rna_dir, "logos_all_motifs/")
  cisbp_rna_motif_dir <- paste0(cisbp_rna_dir, "pwms_all_motifs/")
  cisbp_rna_motifs <-
    init_cisbp_rna_motifs(
      cisbp_rna_metainfo_file,
      cisbp_rna_logo_dir,
      cisbp_rna_motif_dir
    )

  message("processing RBPDB motifs")

  rbpdb_dir <- "data-raw/RBPDB_v1.3.1_human_2012-11-21_CSV/"
  rbpdb_metainfo_proteins_file <-
    paste0(rbpdb_dir, "RBPDB_v1.3.1_proteins_human_2012-11-21.csv")
  rbpdb_metainfo_experiments_file <-
    paste0(
      rbpdb_dir,
      "RBPDB_v1.3.1_experiments_human_2012-11-21.csv"
    )
  rbpdb_metainfo_mapping_file <-
    paste0(rbpdb_dir, "RBPDB_v1.3.1_protExp_human_2012-11-21.csv")
  rbpdb_motif_dir <- paste0(rbpdb_dir, "PFMDir/")
  rbpdb_motifs <-
    init_rbpdb_motifs(
      rbpdb_metainfo_proteins_file,
      rbpdb_metainfo_experiments_file,
      rbpdb_metainfo_mapping_file,
      rbpdb_motif_dir
    )

  motifs <- c(cisbp_rna_motifs, rbpdb_motifs)

  # remove all logos without RBP labels or length < 6
  idx <- sapply(motifs, function(motif) {
    return(!is.null(get_rbps(motif)) &&
      stringr::str_trim(get_rbps(motif)) != "" && get_width(motif) > 5)
  })
  motifs <- motifs[idx]

  # remove NULL entries
  motifs <- motifs[!sapply(motifs, is.null)]

  save(motifs, file = "data/motifs.rda", compress = "xz")
  invisible(motifs)
}

#' parsing CISBP-RNA motifs
init_cisbp_rna_motifs <-
  function(cisbp_rna_metainfo_file,
             cisbp_rna_logo_dir,
             cisbp_rna_motif_dir) {
    codes_map <- init_iupac_lookup_table()

    cisbp_rna_metainfo <-
      read.table(
        cisbp_rna_metainfo_file,
        sep = "\t",
        header = TRUE,
        stringsAsFactors = FALSE
      )
    cisbp_rna_motif_files <-
      list.files(
        path = cisbp_rna_motif_dir,
        full.names = TRUE,
        pattern = "txt"
      )

    valid_files <-
      apply(as.matrix(cisbp_rna_motif_files), 1, function(x) {
        file.info(x)$size > 0
      })
    cisbp_rna_motif_files <- cisbp_rna_motif_files[valid_files]
    counter <- 0
    cisbp_rna_motifs <-
      apply(as.matrix(cisbp_rna_motif_files), 1, function(file) {
        id <- as.character(basename(tools::file_path_sans_ext(file)))
        rbp_info <- filter(cisbp_rna_metainfo, Motif_ID == id)
        rbps <- stringr::str_trim(rbp_info$RBP_Name)
        species <-
          stringr::str_trim(gsub("_", " ", rbp_info$RBP_Species[1], fixed = TRUE))
        type <- stringr::str_trim(rbp_info$Motif_Type[1])
        src <- stringr::str_trim(rbp_info$MSource_Identifier[1])
        if (length(rbp_info$MSource_Version) > 0 &&
          !is.null(rbp_info$MSource_Version[1]) &&
          rbp_info$MSource_Version[1] != "" &&
          rbp_info$MSource_Version[1] != "NULL") {
          src <- paste(src, stringr::str_trim(rbp_info$MSource_Version[1]))
        }

        matrix <- read.table(file, header = TRUE, sep = "\t")
        matrix <- matrix[, !(names(matrix) == "Pos")]

        iupac <- generate_iupac_by_matrix(matrix, code = codes_map)

        # check whether zero probabilities are present
        if (!(min(matrix) > 0)) {
          # convert probabilities (PPM) to frequencies (PFM) - assuming smallest probability is count 1
          matrix <- matrix / min(matrix[matrix > 0])

          # Laplace smoothing / additive smoothing to avoid zeros
          matrix <- matrix + 0.25

          # convert absolute frequencies (PFM) to relative frequencies (PPM)
          matrix <- matrix / rowSums(matrix)
        }

        # converting relative frequencies (PPM) to weights (PWM)
        # assuming equiprobable nucleotides -> a priori probabilities: 0.25
        matrix <- log2(matrix / 0.25)

        length <- nrow(matrix)

        hexamers <- generate_kmers_from_iupac(iupac, 6)

        heptamers <- generate_kmers_from_iupac(iupac, 7)

        counter <<- counter + 1
        message(paste0("processed ", counter, " PSSMs"))
        return(new("RBPMotif", id = id, rbps = rbps, matrix = matrix,
                   hexamers = hexamers, heptamers = heptamers,
                   length = as.integer(length), iupac = iupac,
                   type = type, species = species, src = src))
      })

    # remove redundant motifs (motifs also present in RBPDB)
    redundant_motif_ids <-
      c(
        "M269_0.6",
        "M296_0.6",
        "M349_0.6",
        "M353_0.6",
        "M273_0.6",
        "M331_0.6",
        "M352_0.6",
        "M318_0.6",
        "M319_0.6",
        "M256_0.6",
        "M245_0.6",
        "M325_0.6",
        "M346_0.6",
        "M323_0.6",
        "M274_0.6",
        "M345_0.6",
        "M291_0.6",
        "M354_0.6",
        "M290_0.6",
        "M328_0.6",
        "M272_0.6",
        "M347_0.6",
        "M262_0.6",
        "M271_0.6",
        "M298_0.6",
        "M332_0.6",
        "M350_0.6",
        "M351_0.6",
        "M348_0.6",
        "M317_0.6",
        "M329_0.6",
        "M330_0.6"
      )

    non_redundant_idx <-
      sapply(cisbp_rna_motifs, function(motif) {
        return(!(get_id(motif) %in% redundant_motif_ids))
      })

    cisbp_rna_motifs <- cisbp_rna_motifs[non_redundant_idx]

    return(cisbp_rna_motifs)
  }

#' parsing RBPDB motifs
init_rbpdb_motifs <-
  function(rbpdb_metainfo_proteins_file,
             rbpdb_metainfo_experiments_file,
             rbpdb_metainfo_mapping_file,
             rbpdb_motif_dir,
             rbpdb_version = "v1.3.1") {
    codes_map <- init_iupac_lookup_table()

    rbpdb_metainfo_proteins <-
      read.table(
        rbpdb_metainfo_proteins_file,
        sep = ",",
        header = FALSE,
        stringsAsFactors = FALSE
      )
    rbpdb_metainfo_experiments <-
      read.table(
        rbpdb_metainfo_experiments_file,
        sep = ",",
        header = FALSE,
        stringsAsFactors = FALSE
      )
    rbpdb_metainfo_mapping <-
      read.table(
        rbpdb_metainfo_mapping_file,
        sep = ",",
        header = FALSE,
        stringsAsFactors = FALSE
      )
    rbpdb_motif_files <-
      list.files(
        path = rbpdb_motif_dir,
        full.names = TRUE,
        pattern = "pfm"
      )

    colnames(rbpdb_metainfo_proteins) <-
      c(
        "uniqueId",
        "annotId",
        "createDate",
        "updateDate",
        "geneName",
        "geneDesc",
        "species",
        "taxID",
        "domains",
        "flag",
        "flagNote",
        "aliases",
        "PDBIDs"
      )
    colnames(rbpdb_metainfo_experiments) <-
      c(
        "uniqueId",
        "expId",
        "exptype",
        "notes",
        "sequence_motif",
        "SELEX_file",
        "aligned_SELEX_file",
        "aligned_motif_file",
        "logo_file",
        "PWM_file",
        "PFM_file",
        "invivo_notes",
        "invivo_file",
        "secondary_structure",
        "flag"
      )
    colnames(rbpdb_metainfo_mapping) <-
      c("uniqueId", "proteinId", "expId", "homolog")

    valid_files <-
      apply(as.matrix(rbpdb_motif_files), 1, function(x) {
        file.info(x)$size > 0
      })
    rbpdb_motif_files <- rbpdb_motif_files[valid_files]

    counter <- 0
    rbpdb_motifs <-
      apply(as.matrix(rbpdb_motif_files), 1, function(file) {
        id <- as.character(basename(tools::file_path_sans_ext(file)))
        protein_id <-
          unlist(strsplit(id, "_", fixed = TRUE))[1]
        exp_id <- unlist(strsplit(id, "_", fixed = TRUE))[2]
        mapping <-
          filter(rbpdb_metainfo_mapping, proteinId == protein_id)
        if (nrow(mapping) != 1) {
          warning(paste0("protein id mapping not unique for id: ", protein_id))
        }
        unique_id <- stringr::str_trim(mapping$uniqueId[1])
        proteins_metainfo <-
          filter(rbpdb_metainfo_proteins, uniqueId == unique_id)
        if (nrow(mapping) != 1) {
          warning(paste0("unique id mapping not unique for id: ", unique_id))
        }

        experiments_metainfo <-
          filter(rbpdb_metainfo_experiments, expId == exp_id)
        if (nrow(mapping) != 1) {
          warning(paste0("exp id mapping not unique for id: ", exp_id))
        }

        rbps <- stringr::str_trim(proteins_metainfo$geneName[1])
        species <- stringr::str_trim(proteins_metainfo$species[1])
        type <- stringr::str_trim(experiments_metainfo$exptype[1])
        src <- paste("RBPDB", rbpdb_version)

        matrix <- read.table(file, header = FALSE)
        matrix <- t(matrix)

        # convert PPM to PFM if matrix is PPM
        if (!all(matrix %% 1 == 0)) {
          # PPM
          # convert probabilities (PPM) to frequencies (PFM) - assuming smallest probability is count 1
          matrix <- matrix / min(matrix[matrix > 0])
        }

        # check whether zero probabilities are present
        if (!(min(matrix) > 0)) {
          # Laplace smoothing / additive smoothing to avoid zeros
          matrix <- matrix + 0.25
        }

        # convert absolute frequencies to relative frequencies
        matrix <- matrix / rowSums(matrix)

        matrix <- data.frame(matrix)
        colnames(matrix) <- c("A", "C", "G", "U")
        rownames(matrix) <- 1:nrow(matrix)

        iupac <- generate_iupac_by_matrix(matrix, code = codes_map)

        # convert relative frequencies (PFM) to weights (PWM)
        # assuming equiprobable nucleotides -> a priori probabilities: 0.25
        matrix <- log2(matrix / 0.25)

        length <- nrow(matrix)

        hexamers <- generate_kmers_from_iupac(iupac, 6)

        heptamers <- generate_kmers_from_iupac(iupac, 7)

        counter <<- counter + 1
        message(paste0("processed ", counter, " PSSMs"))
        return(new("RBPMotif", id = id, rbps = rbps, matrix = matrix,
                   hexamers = hexamers, heptamers = heptamers,
                   length = as.integer(length), iupac = iupac,
                   type = type, species = species, src = src))
      })

    return(rbpdb_motifs)
  }

init_motifs()

shrishdwivedi commented 1 year ago

Hi! My apologies for the delayed response, I had missed this message. This is excellent - I had switched to using PPMs / PWMs, and my motif database is CISBP-RNA. This script will be great to validate that the object build was correct. Thanks again for your clarification and guidance!

kkrismer / transite

Matrix-based motifs for other species #1