jhkorhonen / MOODS

MOODS: Motif Occurrence Detection Suite
Other
95 stars 12 forks source link

mood-score threshold choice for a given p-value and PWM is different across chromosomes #44

Open osyafinkelberg opened 1 year ago

osyafinkelberg commented 1 year ago

Hello!

I am scanning the human genome for CTCF motifs using a PWM-format matrix:

moods-dna.py --sep ";" -s hg38.fna --p-value 0.0001 -S MA0139.1_pwm -o ctcf_scan

File hg38.fna contains sequences for all chromosomes each starting with the >chr...\n line. I plotted mood-score distributions for each chromosome separately and found out that the mood-score threshold value for each chromosome is different. For example for chr1 it is -13.209, while for chr17 it is -12.677.

As far as I understand the threshold choice procedure from the MOOD-wiki page this threshold should depend only on the given PWM and p-value. How can it be that the mood-score value thresholds are different for each chromosome?

Thank you very much!

jhkorhonen commented 1 year ago

The log-odds and p-value computation also depends on the background distribution, intuitively describing how the sequence looks like if there is no coding information in it. See the corresponding wiki page for more information on this.

By default, MOODS just looks at the current input sequence at hands and estimates this distribution from the frequencies of different symbols in the input. If you want to use a consistent background for all sequences, you can use parameters

--batch --bg pA pC pG pT --lo-bg pA pC pG pT

where pA pC pG pT is the background distribution you want to use (and --batch is just optimises the process somewhat). You can in principle estimate this distribution by computing the nucleotide frequencies from the whole human DNA, but you may want to consult an actual biologist on what is the correct assumption here.

osyafinkelberg commented 1 year ago

Thank you so much for your answer!

So is the --lo-bg flag used for both PWM construction and later "independently" for the threshold T choice ?

Did I understand correctly, that if I provide the already computed PWM and --lo-bg parameter, the latter will influence only the threshold T choice but not the PWM (that is all individual scores will be computed using the provided PWM without normalization by the --lo-bg frequencies)?

jhkorhonen commented 1 year ago

With pre-computed PWM and -S, the parameter --lo-bg doesn't do anything, as it is used only for log-odds conversion if that is done.

You want to set --batch --bg in your use case if I understood it correctly. Then the input matrices will not be converted, and the threshold is computed from the given p-value once, using the given background distribution, and that is used for all sequences.

Now that I look at this, there is some illogical behaviour and poorly documented behaviour regarding this in the moods_dna.py script. Marking this for improvement.

osyafinkelberg commented 1 year ago

Thank you very much!