pinellolab / dictys

Context specific and dynamic gene regulatory network reconstruction and analysis
GNU Affero General Public License v3.0
101 stars 13 forks source link

Use of v10nr_clust_public in SCENIC+ #26

Closed pchiang5 closed 10 months ago

pchiang5 commented 10 months ago

Hello,

Thank you for the great tool.

Can dictys use the motif database (https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/) of SCENIC+? Their database seemed to be in cluster-buster format. Is there a way to convert it to the HOMER one?

lingfeiwang commented 10 months ago

Thanks, pchiang5. That is a great question. @nikostrasan Any idea?

nikostrasan commented 10 months ago

Hello pchiang5, that is an interesting question, indeed. Dictys uses Homer to scan for motifs using a reference motif input file (e.g. for human data, we used "HOCOMOCOv11_full_HUMAN_mono_homer_format_0.0001.motif"). One can use any custom ".motif" file as input, as long as it is in a Homer format (http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html).

We haven't tried to use the SCENIC+ (rcistarget) ".cb" motifs, yet. However, one suggestion would be to merge the individual ".cb" files of interest (from https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/singletons/ ) into one file, and then convert this to Homer format (http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html).

pchiang5 commented 10 months ago

@lingfeiwang & nikostrasan

I managed to make the motifs.motif by your suggestions. Thanks.

However, the resultant motifs.motif contains 313894 motifs (~11000 unique ones for ~1400 unique gene names, which means each motif is represented by multiple gene names). It took forever to run the following step:

OPENBLAS_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_MAX_THREADS=1 NUMEXPR_MAX_THREADS=1 MKL_MAX_THREADS=1 python3 -m dictys  chromatin homer --nth 30 tmp_static/Subset0/footprints.bed data/motifs.motif data/genome tmp_static/Subset0/expression.tsv.gz tmp_static/Subset0/motifs.bed tmp_static/Subset0/wellington.tsv.gz tmp_static/Subset0/homer.tsv.gz

Is there any way to speed it up (Like run with the 11000 unique motifs and assign these multiple gene names to these motifs later)?

nikostrasan commented 10 months ago

Hello @pchiang5 , thanks for your question. Dictys uses homer findMotifsGenome.pl to scan for motifs from a reference motif file, which performs a series of processes for each motif (e.g. background corrections, normalizations etc.) and takes a reasonable time (http://homer.ucsd.edu/homer/ngs/peakMotifs.html). During our benchmarking, we used alternative, curated motifDBs (including Homer motifs, Jaspar, Hocomocov11, etc.), which were specific for the organism we were examining (e.g. _HOCOMOCOv11_core_HUMAN_mono_homer_format0.0001.motif for human datasets and _HOCOMOCOv11_core_MOUSE_mono_homer_format0.0001.motif for mouse datasets). Each one of them contained roughly 100-2000 motifs, but not so many as in your case (~300k), so we will need to investigate this further.

However, two suggestions that come to my mind are:

  1. Reduce the number of motifs to scan. For this you could: a) focus on the motifs in the organism you study (rather than using motifs found in all organisms). This might also improve the specificity of your results.
    b) remove motifs that are highly similar to each other. Rcistarget should provide similarity metrics ( e.g. motif_similarity_qvalue) to help you keep one representative motif per cluster.

  2. Our current Dictys version uses the older homer package to run findMotifsGenome.pl. However, in the new homer2 tool, there are additional parameters for the users, including the " -p <#> (Number of processors to use, default: 1) ", which could possibly accelerate processing time ( http://homer.ucsd.edu/homer/ngs/peakMotifs.html ) . We will include this modification to our next version of Dictys. (@lingfeiwang @nikostrasan ). Thank you for your questions and feedback.

lingfeiwang commented 10 months ago

Thanks, @pchiang5 and @nikostrasan. Let me add to the discussion.

We can definitely improve speed as you suggested, based on unique motifs. We can work on that and let you know when it finishes, hopefully in a few days. Can you help testing it?

Still, 11000 unique motifs appear a very large number. It is 1/3 of all 8-mers and will probably be mapped to everywhere. Is there any way to shrink it on your side as @nikostrasan suggested?

Also, are the log odds detection threshold computed correctly in your motif file? It is the third column. See http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html.

Btw, if you are running the pipeline to infer several GRNs, such as cell-type specific or dynamic GRNs, we do not suggest to set n_threads>4 (in your case 30). The parallelism should be way more efficient between GRN inference tasks than within.

I believe we tested the homer's -p parameter here but it didn't work so well as intended @nikostrasan.

If that sounds good to you, we can start working on it. So please let us know!

Lingfei

pchiang5 commented 10 months ago

Hello @lingfeiwang and @nikostrasan,

Certainly I am more than willing to do the tests!

This database contains ~11,000 motifs and might be able to collapse into ~4,000 ones by metafamilies, which is still a big number. Further, a major bottleneck is the association of multiple gene names with each motif. If findMotifsGenome.pl needs to run individual identical motifs of different gene names, it would take a tremendous amount of time (like from 11,000 to 330K in this case)...

The log odds detection threshold could be specified in the write_homer() at the last step (please see below) if my understanding is right.

Below are my scripts to generate the motifs.motif in Homer format for your reference. Hopefully it is not due to issues during the format conversion.

  1. concatenate .cb files with sep '=='
    # Create the 'modified' directory if it doesn't exist
    mkdir -p ./modified
    for file in *.cb; do
    base=$(basename "$file" .cb)  # Get the filename without extension
    modified_filename="./modified/${base}.cb"  # Path for the modified file
    sed "s/^>.*/>$base/" "$file" > "$modified_filename"  # Modify and save the file
    done
    cd modified
    cat *.cb > combined.csv
    awk '/^>/ && ++count>1 { print "==" } 1' combined.csv > marked.csv
  2. make motifs.motif of Homer format (downloaded from SCENIC+ cistarget database)
    
    library(universalmotif)
    # read motif
    cisbp <- read_matrix(c('/v10nr_clust_public/singletons/modified/marked.csv'), skip = 0, 'PPM', positions = "rows",
    alphabet = "DNA",   headers = "^>", sep = '==', rownames = FALSE,
    comment = NULL)
    length(cisbp)
    #[1] 17995
    ref = data.table::fread('/v10nr_clust_public/snapshots/motifs-v10-nr.hgnc-m0.00001-o0.0.tbl')
    dim(ref)
    #[1] 253096     13

filtered_cisbp <- list() for (i in seq_along(cisbp)) { motif_name <- cisbp[[i]]@name if (motif_name %in% ref$#motif_id) { filtered_cisbp <- c(filtered_cisbp, cisbp[[i]]) } }

length(filtered_cisbp) #unique motifs(including families)

[1] 11466

named_cisbp = lapply(seq_along(filtered_cisbp), function(i) { motif_name <- filtered_cisbp[[i]]@name matching_gene_names <- motif_gene_dict[[motif_name]] if (!is.null(matching_gene_names)) { lapply(matching_gene_names, function(gene_name) { new_motif_name <- paste(gene_name, '_HUMAN.H11MO.', i, '.B', sep = '') new_motif <- cisbp[[i]] new_motif@name <- new_motif_name new_motif }) } }) length(named_cisbp)

[1] 11466

flattened_cisbp <- unlist(named_cisbp, recursive = FALSE) length(flattened_cisbp)

[1] 313894 <== got massive expansion at this step and the same motifs are represented by different gene names

write_homer(flattened_cisbp, 'data/motifs.motif', threshold = 0.0001, threshold.type = "pvalue", overwrite = T, append = FALSE)

lingfeiwang commented 10 months ago

Thank you, pchiang5.

Actually it would be easier to see several motifs from your converted file. Could you paste one motif in the begininng, one in the middle, and one in the end of your converted HOMER motif file? I am hoping to have all other aspects sorted, especially the log odds detection threshold, before adapting Dictys to unique motifs.

Lingfei

pchiang5 commented 10 months ago

@lingfeiwang

Please see below for the clips and the zipped motifs file.

motifs.zip

first 2:

>RTGRGAR    RBPJ_HUMAN.H11MO.5.B    10.932
0.500   0.000   0.500   0.000
0.000   0.000   0.000   1.000
0.000   0.000   1.000   0.000
0.500   0.000   0.500   0.000
0.000   0.000   1.000   0.000
1.000   0.000   0.000   0.000
0.500   0.000   0.500   0.000
>SNNCTAATCCN    ETV2_HUMAN.H11MO.8.B    10.604
0.208   0.458   0.292   0.042
0.208   0.417   0.292   0.083
0.250   0.333   0.292   0.125
0.083   0.542   0.208   0.167
0.042   0.083   0.000   0.875
1.000   0.000   0.000   0.000
1.000   0.000   0.000   0.000
0.083   0.000   0.125   0.792
0.042   0.833   0.042   0.083
0.042   0.833   0.042   0.083
0.208   0.292   0.250   0.250

middle 2:

>ATGCGGGY   MAX_HUMAN.H11MO.8758.B  10.477
0.867   0.022   0.089   0.022
0.000   0.000   0.000   1.000
0.062   0.000   0.917   0.021
0.000   0.979   0.021   0.000
0.021   0.000   0.875   0.104
0.000   0.000   1.000   0.000
0.021   0.000   0.979   0.000
0.000   0.323   0.000   0.677
>NTRNGGGYNN GCM1_HUMAN.H11MO.8759.B 8.868
0.465   0.213   0.222   0.101
0.123   0.125   0.085   0.667
0.302   0.193   0.498   0.008
0.167   0.497   0.202   0.134
0.109   0.084   0.619   0.188
0.016   0.008   0.936   0.041
0.028   0.008   0.937   0.027
0.110   0.337   0.124   0.429
0.299   0.219   0.257   0.226
0.224   0.287   0.263   0.226

last 2:

>TTAYRTAA   TEF_HUMAN.H11MO.17980.B 9.868
0.047   0.050   0.035   0.868
0.018   0.016   0.059   0.907
0.921   0.038   0.026   0.015
0.026   0.608   0.042   0.324
0.326   0.069   0.556   0.049
0.017   0.019   0.032   0.932
0.832   0.079   0.036   0.053
0.861   0.040   0.037   0.062
>GGGRSGCNNW GTF2B_HUMAN.H11MO.17995.B   11.175
0.000   0.000   0.999   0.000
0.000   0.000   0.999   0.000
0.175   0.000   0.825   0.000
0.330   0.000   0.669   0.000
0.000   0.625   0.374   0.000
0.000   0.000   0.999   0.000
0.000   0.874   0.125   0.000
0.149   0.303   0.101   0.446
0.426   0.193   0.097   0.285
0.251   0.000   0.000   0.749
lingfeiwang commented 10 months ago

Thank you, pchiang5.

They look mostly fine to me. A minor difference with http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html is they seem to use the minimum value of 0.001. Exactly 0 might penalize odds ratio harshly, but given that your motif file seems to tolerate 0 mismatch, that shouldn't make a difference.

@nikostrasan, if you catch anything odd, could you let us know please?

@pchiang5, I will adapt Dictys for unique motifs. I'm considering the unique motif name format GATA1,GATA2,GATA3_HUMAN.H11MO... instead of duplicated GATA1_HUMAN.H11MO..., GATA2_HUMAN.H11MO..., GATA3_HUMAN.H11MO.... If that looks good to you, could you also switch your unique motif names to this format please?

I will let you know when the feature is ready.

Lingfei

pchiang5 commented 10 months ago

@lingfeiwang and @nikostrasan,

Thank you for your prompt response.

Please see below and let me know if any modification is required.

motifs.zip

>RTGRGAR    RBPJ_HUMAN.H11MO.1.B    10.932
0.500   0.000   0.500   0.000
0.000   0.000   0.000   1.000
0.000   0.000   1.000   0.000
0.500   0.000   0.500   0.000
0.000   0.000   1.000   0.000
1.000   0.000   0.000   0.000
0.500   0.000   0.500   0.000
>SNNCTAATCCN    ETV2,GSC2,PITX3_HUMAN.H11MO.2.B 10.604
0.208   0.458   0.292   0.042
0.208   0.417   0.292   0.083
0.250   0.333   0.292   0.125
0.083   0.542   0.208   0.167
0.042   0.083   0.000   0.875
1.000   0.000   0.000   0.000
1.000   0.000   0.000   0.000
0.083   0.000   0.125   0.792
0.042   0.833   0.042   0.083
0.042   0.833   0.042   0.083
0.208   0.292   0.250   0.250
>RTAAAA FOXC2_HUMAN.H11MO.3.B   11.037
0.667   0.000   0.333   0.000
0.000   0.000   0.000   1.000
1.000   0.000   0.000   0.000
1.000   0.000   0.000   0.000
0.800   0.200   0.000   0.000
1.000   0.000   0.000   0.000
>GGGGAWWYCYS    NFKB1,NFKB2,REL,RELA,RELB_HUMAN.H11MO.4.B   10.579
0.130   0.040   0.600   0.230
0.000   0.030   0.970   0.000
0.000   0.000   1.000   0.000
0.000   0.000   1.000   0.000
0.830   0.000   0.000   0.170
0.700   0.030   0.000   0.270
0.270   0.000   0.000   0.730
0.200   0.270   0.000   0.530
0.030   0.940   0.000   0.030
0.000   0.560   0.070   0.370
0.170   0.470   0.300   0.060
lingfeiwang commented 10 months ago

Hi pchiang5,

The multi-TF motif functionality is now ready in this commit. You can first install Dictys 1.0.0 and then upgrade with pip3 install --no-deps --force-reinstall git+https://github.com/pinellolab/dictys@ef1436c in the conda environment.

Could you try this version and let us know how it works out? I also suggest to try different motif files, either from other databases or your current one after further refinement, and see which one gives the most relevant biology. I feel the best GRN should be inferred with a motif file with the right balance between motif quantity, quality, and specificity.

Lingfei

pchiang5 commented 10 months ago

Hi Lingfei,

This database yielded reasonable output with SCENIC+ workflow. Thus, I wonder if dictys could complement my learning with the same database.

Still, the database seemed too big that it remained at chromatin homer for 24 hours+. Let us see if it can get to the next step soon.

lingfeiwang commented 10 months ago

Hi @pchiang5,

I tested your motif file with the short-multiome tutorial, using only the first 10000 lines. The chromatin homer command finished in two minutes. You may want to double check the log odds detection threshold against http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html. The values you obtained from universalmotif do not seem to match the official computation method by sequence or PWM. Stronger values can give no result and finish the run way right away, but weaker values can make the run extremely slow. You may have a combination of them.

Lingfei

pchiang5 commented 10 months ago

Thank you @Lingfei!

Yes, the step stuck for 48 hours already. I will try to find out what is going wrong with my conversion.