Closed Tato14 closed 4 years ago
Hi!
If you have PWMs for mouse RBPs, you can use them with both the website (https://transite.mit.edu) and the R / Bioconductor package. The website supports mouse RefSeq identifiers and MGI symbols.
If you don't have mouse-specific PWMs, you can use the human PWMs in the Transite database on mouse transcripts. Mouse and human RBP motifs seem to be fairly similar.
I'm not planning to add new motifs to the permanent Transite database in the new future, unless you have a set of motifs that you would like me to add.
Cheers, Konstantin
Regarding analysing data from other species: RefSeq identifiers : Would I be able to use fission yeast if I have PWMs?
Thanks Shan
See #2
Hi! Thanks for your work on this package. My apologies for re-opening an old issue, but I was using Transite and ran into this same issue (wanting to use mouse motifs for mouse transcripts in R).
My thinking was that I would build my own motifs object in R using mouse motifs, and that I would create each motif using the provided R functions create_matrix_motif or create_k-mer_motif. However, I noticed that any motif created using the create_k-mer_motif function does not have an associated PWM stored (which makes sense, as the PWM is not an argument to the function).
My question is that if I want to use run_kmer_tsma using my own motif object of custom mouse motifs, will there be an issue if the motifs are made using create_k-mer_motif? All I need are the k-mer enrichment results and the motif enrichment results. I am looking through the package and literature to answer this myself, but wanted to post in case you had any guidance as well.
Thank you very much.
Hi!
That's correct! Just make sure all the k-mers you provide to create_kmer_motif
are hexamers.
Your custom motifs created by create_kmer_motif
are compatible with both run_kmer_tsma
and run_kmer_spma
. It's true that the matrix
property of motifs created by create_kmer_motif
will not be set, but the position weight matrix is only used by run_matrix_tsma
and run_matrix_spma
.
By the way, the website also supports custom motifs, but only one at a time (both position weight matrices and lists of hexamers).
That sounds great! Thank you for your reply. I am looking to utilize the R package as it will allow me to integrate Transite better into my workflow.
If I may, it would be great to confirm if the following process looks correct. If was to build my own motifs object using custom mouse motifs (using create_kmer_motif for each), it appears I need to follow these steps:
1. Decide which motifs to extract from either CISBP-RNA or RBPDB (or both)
3. For each motif, generate a list of k-mers
4. For each motif, use the generated list of k-mers to create a custom motif (create_kmer_motif)
5. Build 'mouse motifs' object using the custom motifs
Thank you again for the clarification and help. Please excuse any trivial questions on my part - I am still reading the documentation!
I haven't been able to determine quite yet how the motif selection was made from the paper methods, but if you have any recommendations, happy to follow. We removed motifs from CISBP-RNA because they also appeared in the RBPDB.
If your motif source is RBPDB or CISBP-RNA, you can also use their position frequency/weight matrices and import those directly (which will also set hexamers and heptamers), instead of generating hexamers yourself from the IUPAC sequence. I'd only set hexamers yourself if you have a specific set of hexamers you wanna use that's different from the hexamers that can be derived from the PWM.
If you wanna follow the same procedure we used from RBPDB and CISBP-RNA exports to Transite motifs, this is the code we used to initialize the Transite motifs:
library(tools)
library(dplyr)
library(stringr)
library(Biostrings)
library(transite)
init_motifs <- function() {
message("processing CISBP-RNA motifs")
cisbp_rna_dir <-
"data-raw/CISBP-RNA_Homo_sapiens_2015_01_29_2-45_pm/"
cisbp_rna_metainfo_file <-
paste0(cisbp_rna_dir, "RBP_Information_all_motifs.txt")
cisbp_rna_logo_dir <- paste0(cisbp_rna_dir, "logos_all_motifs/")
cisbp_rna_motif_dir <- paste0(cisbp_rna_dir, "pwms_all_motifs/")
cisbp_rna_motifs <-
init_cisbp_rna_motifs(
cisbp_rna_metainfo_file,
cisbp_rna_logo_dir,
cisbp_rna_motif_dir
)
message("processing RBPDB motifs")
rbpdb_dir <- "data-raw/RBPDB_v1.3.1_human_2012-11-21_CSV/"
rbpdb_metainfo_proteins_file <-
paste0(rbpdb_dir, "RBPDB_v1.3.1_proteins_human_2012-11-21.csv")
rbpdb_metainfo_experiments_file <-
paste0(
rbpdb_dir,
"RBPDB_v1.3.1_experiments_human_2012-11-21.csv"
)
rbpdb_metainfo_mapping_file <-
paste0(rbpdb_dir, "RBPDB_v1.3.1_protExp_human_2012-11-21.csv")
rbpdb_motif_dir <- paste0(rbpdb_dir, "PFMDir/")
rbpdb_motifs <-
init_rbpdb_motifs(
rbpdb_metainfo_proteins_file,
rbpdb_metainfo_experiments_file,
rbpdb_metainfo_mapping_file,
rbpdb_motif_dir
)
motifs <- c(cisbp_rna_motifs, rbpdb_motifs)
# remove all logos without RBP labels or length < 6
idx <- sapply(motifs, function(motif) {
return(!is.null(get_rbps(motif)) &&
stringr::str_trim(get_rbps(motif)) != "" && get_width(motif) > 5)
})
motifs <- motifs[idx]
# remove NULL entries
motifs <- motifs[!sapply(motifs, is.null)]
save(motifs, file = "data/motifs.rda", compress = "xz")
invisible(motifs)
}
#' parsing CISBP-RNA motifs
init_cisbp_rna_motifs <-
function(cisbp_rna_metainfo_file,
cisbp_rna_logo_dir,
cisbp_rna_motif_dir) {
codes_map <- init_iupac_lookup_table()
cisbp_rna_metainfo <-
read.table(
cisbp_rna_metainfo_file,
sep = "\t",
header = TRUE,
stringsAsFactors = FALSE
)
cisbp_rna_motif_files <-
list.files(
path = cisbp_rna_motif_dir,
full.names = TRUE,
pattern = "txt"
)
valid_files <-
apply(as.matrix(cisbp_rna_motif_files), 1, function(x) {
file.info(x)$size > 0
})
cisbp_rna_motif_files <- cisbp_rna_motif_files[valid_files]
counter <- 0
cisbp_rna_motifs <-
apply(as.matrix(cisbp_rna_motif_files), 1, function(file) {
id <- as.character(basename(tools::file_path_sans_ext(file)))
rbp_info <- filter(cisbp_rna_metainfo, Motif_ID == id)
rbps <- stringr::str_trim(rbp_info$RBP_Name)
species <-
stringr::str_trim(gsub("_", " ", rbp_info$RBP_Species[1], fixed = TRUE))
type <- stringr::str_trim(rbp_info$Motif_Type[1])
src <- stringr::str_trim(rbp_info$MSource_Identifier[1])
if (length(rbp_info$MSource_Version) > 0 &&
!is.null(rbp_info$MSource_Version[1]) &&
rbp_info$MSource_Version[1] != "" &&
rbp_info$MSource_Version[1] != "NULL") {
src <- paste(src, stringr::str_trim(rbp_info$MSource_Version[1]))
}
matrix <- read.table(file, header = TRUE, sep = "\t")
matrix <- matrix[, !(names(matrix) == "Pos")]
iupac <- generate_iupac_by_matrix(matrix, code = codes_map)
# check whether zero probabilities are present
if (!(min(matrix) > 0)) {
# convert probabilities (PPM) to frequencies (PFM) - assuming smallest probability is count 1
matrix <- matrix / min(matrix[matrix > 0])
# Laplace smoothing / additive smoothing to avoid zeros
matrix <- matrix + 0.25
# convert absolute frequencies (PFM) to relative frequencies (PPM)
matrix <- matrix / rowSums(matrix)
}
# converting relative frequencies (PPM) to weights (PWM)
# assuming equiprobable nucleotides -> a priori probabilities: 0.25
matrix <- log2(matrix / 0.25)
length <- nrow(matrix)
hexamers <- generate_kmers_from_iupac(iupac, 6)
heptamers <- generate_kmers_from_iupac(iupac, 7)
counter <<- counter + 1
message(paste0("processed ", counter, " PSSMs"))
return(new("RBPMotif", id = id, rbps = rbps, matrix = matrix,
hexamers = hexamers, heptamers = heptamers,
length = as.integer(length), iupac = iupac,
type = type, species = species, src = src))
})
# remove redundant motifs (motifs also present in RBPDB)
redundant_motif_ids <-
c(
"M269_0.6",
"M296_0.6",
"M349_0.6",
"M353_0.6",
"M273_0.6",
"M331_0.6",
"M352_0.6",
"M318_0.6",
"M319_0.6",
"M256_0.6",
"M245_0.6",
"M325_0.6",
"M346_0.6",
"M323_0.6",
"M274_0.6",
"M345_0.6",
"M291_0.6",
"M354_0.6",
"M290_0.6",
"M328_0.6",
"M272_0.6",
"M347_0.6",
"M262_0.6",
"M271_0.6",
"M298_0.6",
"M332_0.6",
"M350_0.6",
"M351_0.6",
"M348_0.6",
"M317_0.6",
"M329_0.6",
"M330_0.6"
)
non_redundant_idx <-
sapply(cisbp_rna_motifs, function(motif) {
return(!(get_id(motif) %in% redundant_motif_ids))
})
cisbp_rna_motifs <- cisbp_rna_motifs[non_redundant_idx]
return(cisbp_rna_motifs)
}
#' parsing RBPDB motifs
init_rbpdb_motifs <-
function(rbpdb_metainfo_proteins_file,
rbpdb_metainfo_experiments_file,
rbpdb_metainfo_mapping_file,
rbpdb_motif_dir,
rbpdb_version = "v1.3.1") {
codes_map <- init_iupac_lookup_table()
rbpdb_metainfo_proteins <-
read.table(
rbpdb_metainfo_proteins_file,
sep = ",",
header = FALSE,
stringsAsFactors = FALSE
)
rbpdb_metainfo_experiments <-
read.table(
rbpdb_metainfo_experiments_file,
sep = ",",
header = FALSE,
stringsAsFactors = FALSE
)
rbpdb_metainfo_mapping <-
read.table(
rbpdb_metainfo_mapping_file,
sep = ",",
header = FALSE,
stringsAsFactors = FALSE
)
rbpdb_motif_files <-
list.files(
path = rbpdb_motif_dir,
full.names = TRUE,
pattern = "pfm"
)
colnames(rbpdb_metainfo_proteins) <-
c(
"uniqueId",
"annotId",
"createDate",
"updateDate",
"geneName",
"geneDesc",
"species",
"taxID",
"domains",
"flag",
"flagNote",
"aliases",
"PDBIDs"
)
colnames(rbpdb_metainfo_experiments) <-
c(
"uniqueId",
"expId",
"exptype",
"notes",
"sequence_motif",
"SELEX_file",
"aligned_SELEX_file",
"aligned_motif_file",
"logo_file",
"PWM_file",
"PFM_file",
"invivo_notes",
"invivo_file",
"secondary_structure",
"flag"
)
colnames(rbpdb_metainfo_mapping) <-
c("uniqueId", "proteinId", "expId", "homolog")
valid_files <-
apply(as.matrix(rbpdb_motif_files), 1, function(x) {
file.info(x)$size > 0
})
rbpdb_motif_files <- rbpdb_motif_files[valid_files]
counter <- 0
rbpdb_motifs <-
apply(as.matrix(rbpdb_motif_files), 1, function(file) {
id <- as.character(basename(tools::file_path_sans_ext(file)))
protein_id <-
unlist(strsplit(id, "_", fixed = TRUE))[1]
exp_id <- unlist(strsplit(id, "_", fixed = TRUE))[2]
mapping <-
filter(rbpdb_metainfo_mapping, proteinId == protein_id)
if (nrow(mapping) != 1) {
warning(paste0("protein id mapping not unique for id: ", protein_id))
}
unique_id <- stringr::str_trim(mapping$uniqueId[1])
proteins_metainfo <-
filter(rbpdb_metainfo_proteins, uniqueId == unique_id)
if (nrow(mapping) != 1) {
warning(paste0("unique id mapping not unique for id: ", unique_id))
}
experiments_metainfo <-
filter(rbpdb_metainfo_experiments, expId == exp_id)
if (nrow(mapping) != 1) {
warning(paste0("exp id mapping not unique for id: ", exp_id))
}
rbps <- stringr::str_trim(proteins_metainfo$geneName[1])
species <- stringr::str_trim(proteins_metainfo$species[1])
type <- stringr::str_trim(experiments_metainfo$exptype[1])
src <- paste("RBPDB", rbpdb_version)
matrix <- read.table(file, header = FALSE)
matrix <- t(matrix)
# convert PPM to PFM if matrix is PPM
if (!all(matrix %% 1 == 0)) {
# PPM
# convert probabilities (PPM) to frequencies (PFM) - assuming smallest probability is count 1
matrix <- matrix / min(matrix[matrix > 0])
}
# check whether zero probabilities are present
if (!(min(matrix) > 0)) {
# Laplace smoothing / additive smoothing to avoid zeros
matrix <- matrix + 0.25
}
# convert absolute frequencies to relative frequencies
matrix <- matrix / rowSums(matrix)
matrix <- data.frame(matrix)
colnames(matrix) <- c("A", "C", "G", "U")
rownames(matrix) <- 1:nrow(matrix)
iupac <- generate_iupac_by_matrix(matrix, code = codes_map)
# convert relative frequencies (PFM) to weights (PWM)
# assuming equiprobable nucleotides -> a priori probabilities: 0.25
matrix <- log2(matrix / 0.25)
length <- nrow(matrix)
hexamers <- generate_kmers_from_iupac(iupac, 6)
heptamers <- generate_kmers_from_iupac(iupac, 7)
counter <<- counter + 1
message(paste0("processed ", counter, " PSSMs"))
return(new("RBPMotif", id = id, rbps = rbps, matrix = matrix,
hexamers = hexamers, heptamers = heptamers,
length = as.integer(length), iupac = iupac,
type = type, species = species, src = src))
})
return(rbpdb_motifs)
}
init_motifs()
Hi! My apologies for the delayed response, I had missed this message. This is excellent - I had switched to using PPMs / PWMs, and my motif database is CISBP-RNA. This script will be great to validate that the object build was correct. Thanks again for your clarification and guidance!
Hi! I was wondering if you plan to include PWM for other species in order to run the analysis. I am very interested in mouse :-P Thanks!