matthuska / tRap

R package implementing the TRAP method to predict transcription factor binding affinity to DNA sequences
GNU Lesser General Public License v2.1
8 stars 5 forks source link

getting mostly NaN values for affinity calculations #3

Open chrisclarkson opened 5 years ago

chrisclarkson commented 5 years ago

Hi, I am interested in using your package to calculating the affinity of predicted binding sites and subsequently the significance of the affinity values.

My pipeline (using your instructions from https://rdrr.io/github/matthuska/tRap/man/) is as follows:

pfm
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
A 2  3  5  1  0 13  1  3 18  0  6  1  0  1  1  6  1  2  6  2
C 8  3  2 17 19  0 12 10  1  0  0  1  0  1 18  0 12  7  6  5
G 2 10 12  1  0  5  7  1  1 20 14 14 20 18  0 14  5  3  7  7
T 8  4  2  1  0  2  0  6  0  0  0  5  0  1  1  0  1  8  1  6

pwm <- toPWM(pfm)
pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=pwm, pseudocounts=numeric())
peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L)
peaks_bed = as(peaks, "GRanges")

head(as.data.frame(peaks_bed$siteSeqs)$x)
[1] "AGCCCACTAGGGTGCAGTCC" "ATACCAGAAGAAGGCATCAG" "ACACCAGAAGAGGGCGTCAG"
[4] "ATGCCACGAGGTGGAGATAA" "GACTCACTAGAGGGCACAGG" "TCTACAGCAGGTGGCAACAC"

af=affinity(normalize.pwm(pwm@profileMatrix), as.data.frame(peaks_bed$siteSeqs)$x)

However this results in a many NaN values....

sum(af=='NaN')/length(af)
[1] 0.4785195

I was advised that tRap is used on long sequences rather than short ones so I extended the sequences:

start(peaks_bed) <- start(peaks_bed) - 30
end(peaks_bed) <- end(peaks_bed) + 30
extended_seqs <- getSeq(Mmusculus, peaks_bed)
af_ext=affinity(normalize.pwm(pwm@profileMatrix),as.data.frame(extended_seqs)$x)

However this results in 100% NaNs....

I'm wondering if I am doing something wrong?

matthiasheinig commented 5 years ago

Hi Chris,

do you have zeros in your normalized PWM? We recommend adding small pseudo counts to your frequency matrix..

Best, Matthias

Am 10.01.2019 um 11:36 schrieb Chris Clarkson notifications@github.com:

Hi, I am interested in using your package to calculating the affinity of predicted binding sites and subsequently the significance of the affinity values.

My pipeline (using your instructions from https://rdrr.io/github/matthuska/tRap/man/ https://rdrr.io/github/matthuska/tRap/man/) is as follows:

pfm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A 2 3 5 1 0 13 1 3 18 0 6 1 0 1 1 6 1 2 6 2 C 8 3 2 17 19 0 12 10 1 0 0 1 0 1 18 0 12 7 6 5 G 2 10 12 1 0 5 7 1 1 20 14 14 20 18 0 14 5 3 7 7 T 8 4 2 1 0 2 0 6 0 0 0 5 0 1 1 0 1 8 1 6

pwm <- toPWM(pfm) pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=pwm, pseudocounts=numeric()) peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L) peaks_bed = as(peaks, "GRanges")

head(as.data.frame(peaks_bed$siteSeqs)$x) [1] "AGCCCACTAGGGTGCAGTCC" "ATACCAGAAGAAGGCATCAG" "ACACCAGAAGAGGGCGTCAG" [4] "ATGCCACGAGGTGGAGATAA" "GACTCACTAGAGGGCACAGG" "TCTACAGCAGGTGGCAACAC"

af=affinity(normalize.pwm(pwm@profileMatrix), as.data.frame(peaks_bed$siteSeqs)$x) However this results in a many NaN values....

sum(af=='NaN')/length(af) [1] 0.4785195 I was advised that tRap is used on long sequences rather than short ones so I extended the sequences:

start(peaks_bed) <- start(peaks_bed) - 30 end(peaks_bed) <- end(peaks_bed) + 30 extended_seqs <- getSeq(Mmusculus, peaks_bed) af_ext=affinity(normalize.pwm(pwm@profileMatrix),as.data.frame(extended_seqs)$x)

However this results in 100% NaNs....

I'm wondering if I am doing something wrong?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/matthuska/tRap/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHrHuHLhvR5OL5vxXhUlIcxqhTEVLXQks5vBxfAgaJpZM4Z5OGv.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson commented 5 years ago

Hi Matthias, Thank you for your quick reply: There are not many zeros in the normalised matrix:

normalize.pwm(pwm@profileMatrix)
           1          2           3          4          5          6
A  1.0626965  0.9503723  0.04448988  0.4578354  0.3879632 -0.2923233
C -0.5626965  0.9503723  0.85949647 -0.3735062 -0.1638895  1.0223918
G  1.0626965 -1.3188119 -0.76348283  0.4578354  0.3879632  0.0000000
T -0.5626965  0.4180672  0.85949647  0.4578354  0.3879632  0.2699315
            7          8          9         10          11          12
A  0.41349136  0.4404797 -0.2536981  0.3870730 -0.03296475  0.71519479
C -0.24047408 -0.6112445  0.2969491  0.3870730  0.61061995  0.71519479
G -0.09176563  1.3303426  0.2969491 -0.1612191 -0.18827516 -0.45258182
T  0.91874835 -0.1595778  0.6597998  0.3870730  0.61061995  0.02219224
          13         14         15          16          17         18
A  0.3870730  0.4538871  0.2969491 -0.03296475  0.75263253  1.5229892
C  0.3870730  0.4538871 -0.2536981  0.61061995 -0.47909620 -0.5761614
G -0.1612191 -0.3616612  0.6597998 -0.18827516 -0.02616885  0.8595932
T  0.3870730  0.4538871  0.2969491  0.61061995  0.75263253 -0.8064209
          19         20
A -0.2228909  2.3968502
C -0.2228909  0.0000000
G -0.4123795 -0.9067515
T  1.8581614 -0.4900988

when I try the following, I still get NaNs:

affinity(normalize.pwm(pwm@profileMatrix)+0.0000000000000001, as.data.frame(tmp)$x)
[1] NaN
Warning messages:
1: In log(maxCG/p[3]) : NaNs produced
2: In log(maxAT/p) : NaNs produced
matthiasheinig commented 5 years ago

The PWM you put into the affinity function should be a probability matrix with values between [0-1]

Am 10.01.2019 um 12:50 schrieb Chris Clarkson notifications@github.com:

Hi Matthias, Thank you for your quick reply: There are no zeros in the normalised matrix:

normalize.pwm(pwm@profileMatrix) 1 2 3 4 5 6 A 1.0626965 0.9503723 0.04448988 0.4578354 0.3879632 -0.2923233 C -0.5626965 0.9503723 0.85949647 -0.3735062 -0.1638895 1.0223918 G 1.0626965 -1.3188119 -0.76348283 0.4578354 0.3879632 0.0000000 T -0.5626965 0.4180672 0.85949647 0.4578354 0.3879632 0.2699315 7 8 9 10 11 12 A 0.41349136 0.4404797 -0.2536981 0.3870730 -0.03296475 0.71519479 C -0.24047408 -0.6112445 0.2969491 0.3870730 0.61061995 0.71519479 G -0.09176563 1.3303426 0.2969491 -0.1612191 -0.18827516 -0.45258182 T 0.91874835 -0.1595778 0.6597998 0.3870730 0.61061995 0.02219224 13 14 15 16 17 18 A 0.3870730 0.4538871 0.2969491 -0.03296475 0.75263253 1.5229892 C 0.3870730 0.4538871 -0.2536981 0.61061995 -0.47909620 -0.5761614 G -0.1612191 -0.3616612 0.6597998 -0.18827516 -0.02616885 0.8595932 T 0.3870730 0.4538871 0.2969491 0.61061995 0.75263253 -0.8064209 19 20 A -0.2228909 2.3968502 C -0.2228909 0.0000000 G -0.4123795 -0.9067515 T 1.8581614 -0.4900988 — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matthuska/tRap/issues/3#issuecomment-453069242, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHrHiWrbvdHBb8gUD9wpxql8GFE3g69ks5vBykKgaJpZM4Z5OGv.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson commented 5 years ago

Hi sorry for delay, Hmm very strange.... I just got the PFM from the jaspar 2018 database and then TFBSTools::toPWM command which results in a matrix like the one seen above... Can you recommend a package/ command that could perform the conversion (PFM>PWM) in the way that is necessary for your package to work?

matthiasheinig commented 5 years ago

Hi Chris,

I would recommend to use normalize.pwm from the tRap package directly on your PFM (and add a pseudo count of 0.25)

pwm.for.trap <- normalize.pwm(pfm + 0.25) affinity(pwm.for.trap)

Best, Matthias

Am 10.01.2019 um 23:26 schrieb Chris Clarkson notifications@github.com:

Hi sorry for delay, Hmm very strange.... I just got the PFM from the jaspar 2018 database and then TFBSTools::toPWM command which results in a matrix like the one seen above... Can you recommend a package/ command that could perform the conversion (PFM>PWM) in the way that is necessary for your package to work?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matthuska/tRap/issues/3#issuecomment-453279092, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHrHmF1snaakiE0i4svmEBNDrjRdcqIks5vB74IgaJpZM4Z5OGv.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson commented 5 years ago

Hi again Matthias, Thank you for this fantastic help. It works now Just 2 last questions: 1. I would also like to apply this analysis to more than one transcription factor- hence if I download a list of PFMs from Jaspar:

ARNT
  [,1] [,2] [,3] [,4] [,5] [,6]
A    4   19    0    0    0    0
C   16    0   20    0    0    0
G    0    1    0   20    0   20
T    0    0    0    0   20    0

AHR
  [,1] [,2] [,3] [,4] [,5] [,6]
A    3    0    0    0    0    0
C    8    0   23    0    0    0
G    2   23    0   23    0   24
T   11    1    1    1   24    0

Ddit3::Cebpa
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
A   14   11   18    0    0    4   38   36    0    14     4     0
C    7    7    3    1    0   33    1    2    6    17    23    26
G   12   14   15    0   38    0    0    1    0     5     9     6
T    6    7    3   38    1    2    0    0   33     3     3     7
...

Can I take it as a suitable strategy to add 0.25 to all of these PFMs and then implement your analysis on them?

  1. As for the command local.paffinity, I tried it on the calculated affinity values:
seqs=as.data.frame(extended_seqs)$x
head(seqs)
  A DNAStringSet instance of length 6
    width seq
[1]    79 TACGTAAGTACACTGTAGCTGTCTTCAGACACAC...TCAGATCTCATTATGGGTAGTTGTGAGCTACCA
[2]    79 TTTTACTTTCTCTCTCCCTCTTATTGCTAGATGC...ATAAACAGCTTGCTTCTGCCATGTTCTGCAGAA
[3]    79 GACATCTGAGTACCTTCCCTGTAAGAGAGCTTGC...CTGAGCACTGAAACTCAGAGGAGAGAATCTGTC

head(af_ext)
[1] 10.586463 12.458601 10.153033  7.571788  9.838501 10.966423

af_ext=affinity(normalize.pwm(pwm.for.trap@profileMatrix),seqs)

for(i in c(1:length(af_ext))){
         print(local.paffinity(af_ext[i],pwm.for.trap,seqs[i]))
}

[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
........

The p-values is 0.01612903 in every case.... Can I take these values as correct or am I not applying this function correctly?

matthiasheinig commented 5 years ago

Hi Chris,

re1: I looked at the code again - should have done this first! Actually you can also pass the unnormalized PWM (counts >= 0) to the affinity function. The pseudo count is an argument of the affinity function. So no need to call normalize.pwm and add pseudo counts..

re2: the local.paffinity function takes arguments:

Best, Matthias

Am 11.01.2019 um 11:55 schrieb Chris Clarkson notifications@github.com:

Hi again Matthias, Thank you for this fantastic help. It works now Just 2 last questions: 1. I would also like to apply this analysis to more than one transcription factor- hence if I download a list of PFMs from Jaspar:

ARNT [,1] [,2] [,3] [,4] [,5] [,6] A 4 19 0 0 0 0 C 16 0 20 0 0 0 G 0 1 0 20 0 20 T 0 0 0 0 20 0

AHR [,1] [,2] [,3] [,4] [,5] [,6] A 3 0 0 0 0 0 C 8 0 23 0 0 0 G 2 23 0 23 0 24 T 11 1 1 1 24 0

Ddit3::Cebpa [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] A 14 11 18 0 0 4 38 36 0 14 4 0 C 7 7 3 1 0 33 1 2 6 17 23 26 G 12 14 15 0 38 0 0 1 0 5 9 6 T 6 7 3 38 1 2 0 0 33 3 3 7 ... Can I take it as a suitable strategy to add 0.25 to all of these PFMs and then implement your analysis on them?

As for the command local.paffinity, I tried it on the calculated affinity values: seqs=as.data.frame(extended_seqs)$x head(extended_seqs) A DNAStringSet instance of length 6 width seq [1] 79 TACGTAAGTACACTGTAGCTGTCTTCAGACACAC...TCAGATCTCATTATGGGTAGTTGTGAGCTACCA [2] 79 TTTTACTTTCTCTCTCCCTCTTATTGCTAGATGC...ATAAACAGCTTGCTTCTGCCATGTTCTGCAGAA [3] 79 GACATCTGAGTACCTTCCCTGTAAGAGAGCTTGC...CTGAGCACTGAAACTCAGAGGAGAGAATCTGTC

head(af_ext) [1] 10.586463 12.458601 10.153033 7.571788 9.838501 10.966423

af_ext=affinity(normalize.pwm(pwm.fro.trap@profileMatrix),seqs)

for(i in c(1:length(af_ext))){ print(local.paffinity(af_ext[i],pwm.for.trap,seqs[i])) }

[1] 0.01612903 [1] 0.01612903 [1] 0.01612903 [1] 0.01612903 [1] 0.01612903 ........ The p-values is 0.01612903 in every case.... Can I take these values as correct or am I not applying this function correctly?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matthuska/tRap/issues/3#issuecomment-453480519, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHrHnucBozWW_wPyYaiob9H5rQ7bjqyks5vCG26gaJpZM4Z5OGv.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson commented 5 years ago

Do you mean that I should use the unnormalised PFMs? Not the PWMs as they have values < 0 (as shown above)...