getting mostly NaN values for affinity calculations

chrisclarkson commented 5 years ago

Hi, I am interested in using your package to calculating the affinity of predicted binding sites and subsequently the significance of the affinity values.

My pipeline (using your instructions from https://rdrr.io/github/matthuska/tRap/man/) is as follows:

pfm
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
A 2  3  5  1  0 13  1  3 18  0  6  1  0  1  1  6  1  2  6  2
C 8  3  2 17 19  0 12 10  1  0  0  1  0  1 18  0 12  7  6  5
G 2 10 12  1  0  5  7  1  1 20 14 14 20 18  0 14  5  3  7  7
T 8  4  2  1  0  2  0  6  0  0  0  5  0  1  1  0  1  8  1  6

pwm <- toPWM(pfm)
pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=pwm, pseudocounts=numeric())
peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L)
peaks_bed = as(peaks, "GRanges")

head(as.data.frame(peaks_bed$siteSeqs)$x)
[1] "AGCCCACTAGGGTGCAGTCC" "ATACCAGAAGAAGGCATCAG" "ACACCAGAAGAGGGCGTCAG"
[4] "ATGCCACGAGGTGGAGATAA" "GACTCACTAGAGGGCACAGG" "TCTACAGCAGGTGGCAACAC"

af=affinity(normalize.pwm(pwm@profileMatrix), as.data.frame(peaks_bed$siteSeqs)$x)

However this results in a many NaN values....

sum(af=='NaN')/length(af)
[1] 0.4785195

I was advised that tRap is used on long sequences rather than short ones so I extended the sequences:

start(peaks_bed) <- start(peaks_bed) - 30
end(peaks_bed) <- end(peaks_bed) + 30
extended_seqs <- getSeq(Mmusculus, peaks_bed)
af_ext=affinity(normalize.pwm(pwm@profileMatrix),as.data.frame(extended_seqs)$x)

However this results in 100% NaNs....

I'm wondering if I am doing something wrong?

matthiasheinig commented 5 years ago

Hi Chris,

do you have zeros in your normalized PWM? We recommend adding small pseudo counts to your frequency matrix..

Best, Matthias

Am 10.01.2019 um 11:36 schrieb Chris Clarkson notifications@github.com:

Hi, I am interested in using your package to calculating the affinity of predicted binding sites and subsequently the significance of the affinity values.

My pipeline (using your instructions from https://rdrr.io/github/matthuska/tRap/man/ https://rdrr.io/github/matthuska/tRap/man/) is as follows:

pfm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A 2 3 5 1 0 13 1 3 18 0 6 1 0 1 1 6 1 2 6 2 C 8 3 2 17 19 0 12 10 1 0 0 1 0 1 18 0 12 7 6 5 G 2 10 12 1 0 5 7 1 1 20 14 14 20 18 0 14 5 3 7 7 T 8 4 2 1 0 2 0 6 0 0 0 5 0 1 1 0 1 8 1 6

pwm <- toPWM(pfm) pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=pwm, pseudocounts=numeric()) peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L) peaks_bed = as(peaks, "GRanges")

head(as.data.frame(peaks_bed$siteSeqs)$x) [1] "AGCCCACTAGGGTGCAGTCC" "ATACCAGAAGAAGGCATCAG" "ACACCAGAAGAGGGCGTCAG" [4] "ATGCCACGAGGTGGAGATAA" "GACTCACTAGAGGGCACAGG" "TCTACAGCAGGTGGCAACAC"

af=affinity(normalize.pwm(pwm@profileMatrix), as.data.frame(peaks_bed$siteSeqs)$x) However this results in a many NaN values....

sum(af=='NaN')/length(af) [1] 0.4785195 I was advised that tRap is used on long sequences rather than short ones so I extended the sequences:

start(peaks_bed) <- start(peaks_bed) - 30 end(peaks_bed) <- end(peaks_bed) + 30 extended_seqs <- getSeq(Mmusculus, peaks_bed) af_ext=affinity(normalize.pwm(pwm@profileMatrix),as.data.frame(extended_seqs)$x)

However this results in 100% NaNs....

I'm wondering if I am doing something wrong?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/matthuska/tRap/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHrHuHLhvR5OL5vxXhUlIcxqhTEVLXQks5vBxfAgaJpZM4Z5OGv.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson commented 5 years ago

Hi Matthias, Thank you for your quick reply: There are not many zeros in the normalised matrix:

normalize.pwm(pwm@profileMatrix)
           1          2           3          4          5          6
A  1.0626965  0.9503723  0.04448988  0.4578354  0.3879632 -0.2923233
C -0.5626965  0.9503723  0.85949647 -0.3735062 -0.1638895  1.0223918
G  1.0626965 -1.3188119 -0.76348283  0.4578354  0.3879632  0.0000000
T -0.5626965  0.4180672  0.85949647  0.4578354  0.3879632  0.2699315
            7          8          9         10          11          12
A  0.41349136  0.4404797 -0.2536981  0.3870730 -0.03296475  0.71519479
C -0.24047408 -0.6112445  0.2969491  0.3870730  0.61061995  0.71519479
G -0.09176563  1.3303426  0.2969491 -0.1612191 -0.18827516 -0.45258182
T  0.91874835 -0.1595778  0.6597998  0.3870730  0.61061995  0.02219224
          13         14         15          16          17         18
A  0.3870730  0.4538871  0.2969491 -0.03296475  0.75263253  1.5229892
C  0.3870730  0.4538871 -0.2536981  0.61061995 -0.47909620 -0.5761614
G -0.1612191 -0.3616612  0.6597998 -0.18827516 -0.02616885  0.8595932
T  0.3870730  0.4538871  0.2969491  0.61061995  0.75263253 -0.8064209
          19         20
A -0.2228909  2.3968502
C -0.2228909  0.0000000
G -0.4123795 -0.9067515
T  1.8581614 -0.4900988

when I try the following, I still get NaNs:

affinity(normalize.pwm(pwm@profileMatrix)+0.0000000000000001, as.data.frame(tmp)$x)
[1] NaN
Warning messages:
1: In log(maxCG/p[3]) : NaNs produced
2: In log(maxAT/p) : NaNs produced

matthiasheinig commented 5 years ago

The PWM you put into the affinity function should be a probability matrix with values between [0-1]

Am 10.01.2019 um 12:50 schrieb Chris Clarkson notifications@github.com:

Hi Matthias, Thank you for your quick reply: There are no zeros in the normalised matrix:

normalize.pwm(pwm@profileMatrix) 1 2 3 4 5 6 A 1.0626965 0.9503723 0.04448988 0.4578354 0.3879632 -0.2923233 C -0.5626965 0.9503723 0.85949647 -0.3735062 -0.1638895 1.0223918 G 1.0626965 -1.3188119 -0.76348283 0.4578354 0.3879632 0.0000000 T -0.5626965 0.4180672 0.85949647 0.4578354 0.3879632 0.2699315 7 8 9 10 11 12 A 0.41349136 0.4404797 -0.2536981 0.3870730 -0.03296475 0.71519479 C -0.24047408 -0.6112445 0.2969491 0.3870730 0.61061995 0.71519479 G -0.09176563 1.3303426 0.2969491 -0.1612191 -0.18827516 -0.45258182 T 0.91874835 -0.1595778 0.6597998 0.3870730 0.61061995 0.02219224 13 14 15 16 17 18 A 0.3870730 0.4538871 0.2969491 -0.03296475 0.75263253 1.5229892 C 0.3870730 0.4538871 -0.2536981 0.61061995 -0.47909620 -0.5761614 G -0.1612191 -0.3616612 0.6597998 -0.18827516 -0.02616885 0.8595932 T 0.3870730 0.4538871 0.2969491 0.61061995 0.75263253 -0.8064209 19 20 A -0.2228909 2.3968502 C -0.2228909 0.0000000 G -0.4123795 -0.9067515 T 1.8581614 -0.4900988 — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matthuska/tRap/issues/3#issuecomment-453069242, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHrHiWrbvdHBb8gUD9wpxql8GFE3g69ks5vBykKgaJpZM4Z5OGv.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson commented 5 years ago

Hi sorry for delay, Hmm very strange.... I just got the PFM from the jaspar 2018 database and then TFBSTools::toPWM command which results in a matrix like the one seen above... Can you recommend a package/ command that could perform the conversion (PFM>PWM) in the way that is necessary for your package to work?

matthiasheinig commented 5 years ago

Hi Chris,

I would recommend to use normalize.pwm from the tRap package directly on your PFM (and add a pseudo count of 0.25)

pwm.for.trap <- normalize.pwm(pfm + 0.25) affinity(pwm.for.trap)

Best, Matthias

Am 10.01.2019 um 23:26 schrieb Chris Clarkson notifications@github.com:

Hi sorry for delay, Hmm very strange.... I just got the PFM from the jaspar 2018 database and then TFBSTools::toPWM command which results in a matrix like the one seen above... Can you recommend a package/ command that could perform the conversion (PFM>PWM) in the way that is necessary for your package to work?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matthuska/tRap/issues/3#issuecomment-453279092, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHrHmF1snaakiE0i4svmEBNDrjRdcqIks5vB74IgaJpZM4Z5OGv.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson commented 5 years ago

Hi again Matthias, Thank you for this fantastic help. It works now Just 2 last questions: 1. I would also like to apply this analysis to more than one transcription factor- hence if I download a list of PFMs from Jaspar:

ARNT
  [,1] [,2] [,3] [,4] [,5] [,6]
A    4   19    0    0    0    0
C   16    0   20    0    0    0
G    0    1    0   20    0   20
T    0    0    0    0   20    0

AHR
  [,1] [,2] [,3] [,4] [,5] [,6]
A    3    0    0    0    0    0
C    8    0   23    0    0    0
G    2   23    0   23    0   24
T   11    1    1    1   24    0

Ddit3::Cebpa
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
A   14   11   18    0    0    4   38   36    0    14     4     0
C    7    7    3    1    0   33    1    2    6    17    23    26
G   12   14   15    0   38    0    0    1    0     5     9     6
T    6    7    3   38    1    2    0    0   33     3     3     7
...

Can I take it as a suitable strategy to add 0.25 to all of these PFMs and then implement your analysis on them?

As for the command local.paffinity, I tried it on the calculated affinity values:

seqs=as.data.frame(extended_seqs)$x
head(seqs)
  A DNAStringSet instance of length 6
    width seq
[1]    79 TACGTAAGTACACTGTAGCTGTCTTCAGACACAC...TCAGATCTCATTATGGGTAGTTGTGAGCTACCA
[2]    79 TTTTACTTTCTCTCTCCCTCTTATTGCTAGATGC...ATAAACAGCTTGCTTCTGCCATGTTCTGCAGAA
[3]    79 GACATCTGAGTACCTTCCCTGTAAGAGAGCTTGC...CTGAGCACTGAAACTCAGAGGAGAGAATCTGTC

head(af_ext)
[1] 10.586463 12.458601 10.153033  7.571788  9.838501 10.966423

af_ext=affinity(normalize.pwm(pwm.for.trap@profileMatrix),seqs)

for(i in c(1:length(af_ext))){
         print(local.paffinity(af_ext[i],pwm.for.trap,seqs[i]))
}

[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
........

The p-values is 0.01612903 in every case.... Can I take these values as correct or am I not applying this function correctly?

matthiasheinig commented 5 years ago

Hi Chris,

re1: I looked at the code again - should have done this first! Actually you can also pass the unnormalized PWM (counts >= 0) to the affinity function. The pseudo count is an argument of the affinity function. So no need to call normalize.pwm and add pseudo counts..

re2: the local.paffinity function takes arguments:

the actual affinity computed from the actual sequences
a long background sequence from which to compute the background affinities
window size should be set to the size of the sequence used to compute the actual affinity

Best, Matthias

Am 11.01.2019 um 11:55 schrieb Chris Clarkson notifications@github.com:

Hi again Matthias, Thank you for this fantastic help. It works now Just 2 last questions: 1. I would also like to apply this analysis to more than one transcription factor- hence if I download a list of PFMs from Jaspar:

ARNT [,1] [,2] [,3] [,4] [,5] [,6] A 4 19 0 0 0 0 C 16 0 20 0 0 0 G 0 1 0 20 0 20 T 0 0 0 0 20 0

AHR [,1] [,2] [,3] [,4] [,5] [,6] A 3 0 0 0 0 0 C 8 0 23 0 0 0 G 2 23 0 23 0 24 T 11 1 1 1 24 0

Ddit3::Cebpa [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] A 14 11 18 0 0 4 38 36 0 14 4 0 C 7 7 3 1 0 33 1 2 6 17 23 26 G 12 14 15 0 38 0 0 1 0 5 9 6 T 6 7 3 38 1 2 0 0 33 3 3 7 ... Can I take it as a suitable strategy to add 0.25 to all of these PFMs and then implement your analysis on them?

As for the command local.paffinity, I tried it on the calculated affinity values: seqs=as.data.frame(extended_seqs)$x head(extended_seqs) A DNAStringSet instance of length 6 width seq [1] 79 TACGTAAGTACACTGTAGCTGTCTTCAGACACAC...TCAGATCTCATTATGGGTAGTTGTGAGCTACCA [2] 79 TTTTACTTTCTCTCTCCCTCTTATTGCTAGATGC...ATAAACAGCTTGCTTCTGCCATGTTCTGCAGAA [3] 79 GACATCTGAGTACCTTCCCTGTAAGAGAGCTTGC...CTGAGCACTGAAACTCAGAGGAGAGAATCTGTC

head(af_ext) [1] 10.586463 12.458601 10.153033 7.571788 9.838501 10.966423

af_ext=affinity(normalize.pwm(pwm.fro.trap@profileMatrix),seqs)

for(i in c(1:length(af_ext))){ print(local.paffinity(af_ext[i],pwm.for.trap,seqs[i])) }

[1] 0.01612903 [1] 0.01612903 [1] 0.01612903 [1] 0.01612903 [1] 0.01612903 ........ The p-values is 0.01612903 in every case.... Can I take these values as correct or am I not applying this function correctly?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matthuska/tRap/issues/3#issuecomment-453480519, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHrHnucBozWW_wPyYaiob9H5rQ7bjqyks5vCG26gaJpZM4Z5OGv.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson commented 5 years ago

Do you mean that I should use the unnormalised PFMs? Not the PWMs as they have values < 0 (as shown above)...

matthuska / tRap

getting mostly NaN values for affinity calculations #3