Closed pchiang5 closed 10 months ago
Thanks, pchiang5. That is a great question. @nikostrasan Any idea?
Hello pchiang5, that is an interesting question, indeed. Dictys uses Homer to scan for motifs using a reference motif input file (e.g. for human data, we used "HOCOMOCOv11_full_HUMAN_mono_homer_format_0.0001.motif"). One can use any custom ".motif" file as input, as long as it is in a Homer format (http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html).
We haven't tried to use the SCENIC+ (rcistarget) ".cb" motifs, yet. However, one suggestion would be to merge the individual ".cb" files of interest (from https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/singletons/ ) into one file, and then convert this to Homer format (http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html).
@lingfeiwang & nikostrasan
I managed to make the motifs.motif by your suggestions. Thanks.
However, the resultant motifs.motif contains 313894 motifs (~11000 unique ones for ~1400 unique gene names, which means each motif is represented by multiple gene names). It took forever to run the following step:
OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys chromatin homer --nth 30 tmp_static/Subset0/footprints.bed data/motifs.motif data/genome tmp_static/Subset0/expression.tsv.gz tmp_static/Subset0/motifs.bed tmp_static/Subset0/wellington.tsv.gz tmp_static/Subset0/homer.tsv.gz
Is there any way to speed it up (Like run with the 11000 unique motifs and assign these multiple gene names to these motifs later)?
Hello @pchiang5 , thanks for your question. Dictys uses homer findMotifsGenome.pl to scan for motifs from a reference motif file, which performs a series of processes for each motif (e.g. background corrections, normalizations etc.) and takes a reasonable time (http://homer.ucsd.edu/homer/ngs/peakMotifs.html). During our benchmarking, we used alternative, curated motifDBs (including Homer motifs, Jaspar, Hocomocov11, etc.), which were specific for the organism we were examining (e.g. _HOCOMOCOv11_core_HUMAN_mono_homer_format0.0001.motif for human datasets and _HOCOMOCOv11_core_MOUSE_mono_homer_format0.0001.motif for mouse datasets). Each one of them contained roughly 100-2000 motifs, but not so many as in your case (~300k), so we will need to investigate this further.
However, two suggestions that come to my mind are:
Reduce the number of motifs to scan. For this you could:
a) focus on the motifs in the organism you study (rather than using motifs found in all organisms). This might also improve the specificity of your results.
b) remove motifs that are highly similar to each other. Rcistarget should provide similarity metrics ( e.g.
motif_similarity_qvalue) to help you keep one representative motif per cluster.
Our current Dictys version uses the older homer package to run findMotifsGenome.pl. However, in the new homer2 tool, there are additional parameters for the users, including the " -p <#> (Number of processors to use, default: 1) ", which could possibly accelerate processing time ( http://homer.ucsd.edu/homer/ngs/peakMotifs.html ) . We will include this modification to our next version of Dictys. (@lingfeiwang @nikostrasan ). Thank you for your questions and feedback.
Thanks, @pchiang5 and @nikostrasan. Let me add to the discussion.
We can definitely improve speed as you suggested, based on unique motifs. We can work on that and let you know when it finishes, hopefully in a few days. Can you help testing it?
Still, 11000 unique motifs appear a very large number. It is 1/3 of all 8-mers and will probably be mapped to everywhere. Is there any way to shrink it on your side as @nikostrasan suggested?
Also, are the log odds detection threshold computed correctly in your motif file? It is the third column. See http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html.
Btw, if you are running the pipeline to infer several GRNs, such as cell-type specific or dynamic GRNs, we do not suggest to set n_threads>4
(in your case 30). The parallelism should be way more efficient between GRN inference tasks than within.
I believe we tested the homer's -p
parameter here but it didn't work so well as intended @nikostrasan.
If that sounds good to you, we can start working on it. So please let us know!
Lingfei
Hello @lingfeiwang and @nikostrasan,
Certainly I am more than willing to do the tests!
This database contains ~11,000 motifs and might be able to collapse into ~4,000 ones by metafamilies, which is still a big number. Further, a major bottleneck is the association of multiple gene names with each motif. If findMotifsGenome.pl
needs to run individual identical motifs of different gene names, it would take a tremendous amount of time (like from 11,000 to 330K in this case)...
The log odds detection threshold could be specified in the write_homer()
at the last step (please see below) if my understanding is right.
Below are my scripts to generate the motifs.motif in Homer format for your reference. Hopefully it is not due to issues during the format conversion.
# Create the 'modified' directory if it doesn't exist
mkdir -p ./modified
for file in *.cb; do
base=$(basename "$file" .cb) # Get the filename without extension
modified_filename="./modified/${base}.cb" # Path for the modified file
sed "s/^>.*/>$base/" "$file" > "$modified_filename" # Modify and save the file
done
cd modified
cat *.cb > combined.csv
awk '/^>/ && ++count>1 { print "==" } 1' combined.csv > marked.csv
library(universalmotif)
# read motif
cisbp <- read_matrix(c('/v10nr_clust_public/singletons/modified/marked.csv'), skip = 0, 'PPM', positions = "rows",
alphabet = "DNA", headers = "^>", sep = '==', rownames = FALSE,
comment = NULL)
length(cisbp)
#[1] 17995
ref = data.table::fread('/v10nr_clust_public/snapshots/motifs-v10-nr.hgnc-m0.00001-o0.0.tbl')
dim(ref)
#[1] 253096 13
filtered_cisbp <- list()
for (i in seq_along(cisbp)) {
motif_name <- cisbp[[i]]@name
if (motif_name %in% ref$#motif_id
) {
filtered_cisbp <- c(filtered_cisbp, cisbp[[i]])
}
}
length(filtered_cisbp) #unique motifs(including families)
named_cisbp = lapply(seq_along(filtered_cisbp), function(i) { motif_name <- filtered_cisbp[[i]]@name matching_gene_names <- motif_gene_dict[[motif_name]] if (!is.null(matching_gene_names)) { lapply(matching_gene_names, function(gene_name) { new_motif_name <- paste(gene_name, '_HUMAN.H11MO.', i, '.B', sep = '') new_motif <- cisbp[[i]] new_motif@name <- new_motif_name new_motif }) } }) length(named_cisbp)
flattened_cisbp <- unlist(named_cisbp, recursive = FALSE) length(flattened_cisbp)
write_homer(flattened_cisbp, 'data/motifs.motif', threshold = 0.0001, threshold.type = "pvalue", overwrite = T, append = FALSE)
Thank you, pchiang5.
Actually it would be easier to see several motifs from your converted file. Could you paste one motif in the begininng, one in the middle, and one in the end of your converted HOMER motif file? I am hoping to have all other aspects sorted, especially the log odds detection threshold, before adapting Dictys to unique motifs.
Lingfei
@lingfeiwang
Please see below for the clips and the zipped motifs file.
first 2:
>RTGRGAR RBPJ_HUMAN.H11MO.5.B 10.932
0.500 0.000 0.500 0.000
0.000 0.000 0.000 1.000
0.000 0.000 1.000 0.000
0.500 0.000 0.500 0.000
0.000 0.000 1.000 0.000
1.000 0.000 0.000 0.000
0.500 0.000 0.500 0.000
>SNNCTAATCCN ETV2_HUMAN.H11MO.8.B 10.604
0.208 0.458 0.292 0.042
0.208 0.417 0.292 0.083
0.250 0.333 0.292 0.125
0.083 0.542 0.208 0.167
0.042 0.083 0.000 0.875
1.000 0.000 0.000 0.000
1.000 0.000 0.000 0.000
0.083 0.000 0.125 0.792
0.042 0.833 0.042 0.083
0.042 0.833 0.042 0.083
0.208 0.292 0.250 0.250
middle 2:
>ATGCGGGY MAX_HUMAN.H11MO.8758.B 10.477
0.867 0.022 0.089 0.022
0.000 0.000 0.000 1.000
0.062 0.000 0.917 0.021
0.000 0.979 0.021 0.000
0.021 0.000 0.875 0.104
0.000 0.000 1.000 0.000
0.021 0.000 0.979 0.000
0.000 0.323 0.000 0.677
>NTRNGGGYNN GCM1_HUMAN.H11MO.8759.B 8.868
0.465 0.213 0.222 0.101
0.123 0.125 0.085 0.667
0.302 0.193 0.498 0.008
0.167 0.497 0.202 0.134
0.109 0.084 0.619 0.188
0.016 0.008 0.936 0.041
0.028 0.008 0.937 0.027
0.110 0.337 0.124 0.429
0.299 0.219 0.257 0.226
0.224 0.287 0.263 0.226
last 2:
>TTAYRTAA TEF_HUMAN.H11MO.17980.B 9.868
0.047 0.050 0.035 0.868
0.018 0.016 0.059 0.907
0.921 0.038 0.026 0.015
0.026 0.608 0.042 0.324
0.326 0.069 0.556 0.049
0.017 0.019 0.032 0.932
0.832 0.079 0.036 0.053
0.861 0.040 0.037 0.062
>GGGRSGCNNW GTF2B_HUMAN.H11MO.17995.B 11.175
0.000 0.000 0.999 0.000
0.000 0.000 0.999 0.000
0.175 0.000 0.825 0.000
0.330 0.000 0.669 0.000
0.000 0.625 0.374 0.000
0.000 0.000 0.999 0.000
0.000 0.874 0.125 0.000
0.149 0.303 0.101 0.446
0.426 0.193 0.097 0.285
0.251 0.000 0.000 0.749
Thank you, pchiang5.
They look mostly fine to me. A minor difference with http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html is they seem to use the minimum value of 0.001. Exactly 0 might penalize odds ratio harshly, but given that your motif file seems to tolerate 0 mismatch, that shouldn't make a difference.
@nikostrasan, if you catch anything odd, could you let us know please?
@pchiang5, I will adapt Dictys for unique motifs. I'm considering the unique motif name format GATA1,GATA2,GATA3_HUMAN.H11MO...
instead of duplicated GATA1_HUMAN.H11MO...
, GATA2_HUMAN.H11MO...
, GATA3_HUMAN.H11MO...
. If that looks good to you, could you also switch your unique motif names to this format please?
I will let you know when the feature is ready.
Lingfei
@lingfeiwang and @nikostrasan,
Thank you for your prompt response.
Please see below and let me know if any modification is required.
>RTGRGAR RBPJ_HUMAN.H11MO.1.B 10.932
0.500 0.000 0.500 0.000
0.000 0.000 0.000 1.000
0.000 0.000 1.000 0.000
0.500 0.000 0.500 0.000
0.000 0.000 1.000 0.000
1.000 0.000 0.000 0.000
0.500 0.000 0.500 0.000
>SNNCTAATCCN ETV2,GSC2,PITX3_HUMAN.H11MO.2.B 10.604
0.208 0.458 0.292 0.042
0.208 0.417 0.292 0.083
0.250 0.333 0.292 0.125
0.083 0.542 0.208 0.167
0.042 0.083 0.000 0.875
1.000 0.000 0.000 0.000
1.000 0.000 0.000 0.000
0.083 0.000 0.125 0.792
0.042 0.833 0.042 0.083
0.042 0.833 0.042 0.083
0.208 0.292 0.250 0.250
>RTAAAA FOXC2_HUMAN.H11MO.3.B 11.037
0.667 0.000 0.333 0.000
0.000 0.000 0.000 1.000
1.000 0.000 0.000 0.000
1.000 0.000 0.000 0.000
0.800 0.200 0.000 0.000
1.000 0.000 0.000 0.000
>GGGGAWWYCYS NFKB1,NFKB2,REL,RELA,RELB_HUMAN.H11MO.4.B 10.579
0.130 0.040 0.600 0.230
0.000 0.030 0.970 0.000
0.000 0.000 1.000 0.000
0.000 0.000 1.000 0.000
0.830 0.000 0.000 0.170
0.700 0.030 0.000 0.270
0.270 0.000 0.000 0.730
0.200 0.270 0.000 0.530
0.030 0.940 0.000 0.030
0.000 0.560 0.070 0.370
0.170 0.470 0.300 0.060
Hi pchiang5,
The multi-TF motif functionality is now ready in this commit. You can first install Dictys 1.0.0 and then upgrade with pip3 install --no-deps --force-reinstall git+https://github.com/pinellolab/dictys@ef1436c
in the conda environment.
Could you try this version and let us know how it works out? I also suggest to try different motif files, either from other databases or your current one after further refinement, and see which one gives the most relevant biology. I feel the best GRN should be inferred with a motif file with the right balance between motif quantity, quality, and specificity.
Lingfei
Hi Lingfei,
This database yielded reasonable output with SCENIC+ workflow. Thus, I wonder if dictys could complement my learning with the same database.
Still, the database seemed too big that it remained at chromatin homer
for 24 hours+. Let us see if it can get to the next step soon.
Hi @pchiang5,
I tested your motif file with the short-multiome tutorial, using only the first 10000 lines. The chromatin homer
command finished in two minutes. You may want to double check the log odds detection threshold against http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html. The values you obtained from universalmotif do not seem to match the official computation method by sequence or PWM. Stronger values can give no result and finish the run way right away, but weaker values can make the run extremely slow. You may have a combination of them.
Lingfei
Thank you @Lingfei!
Yes, the step stuck for 48 hours already. I will try to find out what is going wrong with my conversion.
Hello,
Thank you for the great tool.
Can dictys use the motif database (https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/) of SCENIC+? Their database seemed to be in cluster-buster format. Is there a way to convert it to the HOMER one?