leonjessen / tensorflow_rstudio_example

3 stars 0 forks source link

sequence oriented model? #3

Open jjallaire opened 6 years ago

jjallaire commented 6 years ago

@leonjessen I tried this variation of the model that treats the peptides as sequences (this is under the assumption that the order of the peptides actually matters, this may or may not be the case, you would know not I!).


library(keras)
library(tidyverse)
library(ggseqlogo)
library(PepTools)

pep_file = get_file("ran_peps_netMHCpan40_predicted_A0201_reduced_cleaned_balanced.tsv", 
                    origin = "https://git.io/vb3Xa") 
pep_dat  = read_tsv(file = pep_file)

pep_encode_sequence <- function(peptides) {
  A <- as.integer(charToRaw("A"))
  t(sapply(peptides, function(peptide) as.integer(charToRaw(peptide)) - A))
}

x_train = pep_dat %>% filter(data_type == 'train') %>% pull(peptide) %>% pep_encode_sequence
y_train = pep_dat %>% filter(data_type == 'train') %>% pull(label_num) %>% array
x_test  = pep_dat %>% filter(data_type == 'test')  %>% pull(peptide)   %>% pep_encode_sequence
y_test  = pep_dat %>% filter(data_type == 'test')  %>% pull(label_num) %>% array

y_train = to_categorical(y_train, num_classes = 3)
y_test  = to_categorical(y_test,  num_classes = 3)

model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = 25, input_length = 9, output_dim = 32) %>% 
  layer_lstm(units = 32, dropout = 0.2, recurrent_dropout = 0.2) %>% 
  layer_dense(units = 3, activation   = 'softmax')

model %>% compile(
  loss      = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics   = c('accuracy')
)

# gets to ~ 94.5% validation accuracy
history = model %>% fit(
  x_train, y_train, 
  epochs = 500, batch_size = 50, validation_split = 0.2
)

model %>% evaluate(x_test, y_test)
## $loss
## [1] 0.1397115

## $acc
## [1] 0.9393939

It looks to me like a slight improvement, however this could just be the sequence oriented model just reducing to the computational equivalent of the MLP model.

I also don't know whether the embedding layer + LSTM is the "right" way to approach sequence data of this size and nature (I am a complete amateur in these matters so you should defer to your more experienced colleagues as far as what sequence-based architectures are worth exploring, if in fact the sequence has any meaning in the first place).

leonjessen commented 6 years ago

Super!

jjallaire commented 6 years ago

Some nice reaction so far on Twitter: https://twitter.com/search?l=&q=%22https%3A%2F%2Ftensorflow.rstudio.com%2Fblog%2Fdl-for-cancer-immunotherapy.html%22&src=typd&lang=en

leonjessen commented 6 years ago

Very nice!

Do you think that perhaps we could add some sort of link to my twitter profile to the top of the post? I use twitter for academic networking and to stay up to date on R and data science.

...and thank you for this unique opportunity @jjallaire - I really enjoyed the collaboration!

leonjessen commented 6 years ago

Perfect!

jjallaire commented 6 years ago

Link added!

On Tue, Jan 30, 2018 at 1:12 PM, Leon Eyrich Jessen < notifications@github.com> wrote:

Very nice!

Do you think that perhaps we could add some sort of link to my twitter profile to the top of the post? I use twitter for academic networking and to stay up to date on R and data science.

...and thank you for this unique opportunity @jjallaire https://github.com/jjallaire - I really enjoyed the collaboration!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leonjessen/tensorflow_rstudio_example/issues/3#issuecomment-361683431, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXxyOnxL3JQiIlDfak5OltETQX80Kxks5tP1twgaJpZM4Q4J8Z .

leonjessen commented 6 years ago

Will your keynote on Saturday be recorded?

jjallaire commented 6 years ago

Yup, it will be live streamed and then available on YouTube after that.

In preparing the presentation I found that I had so much content to get through that I couldn't dwell on any given example for too long (other than the mnist hello, world). I have your article as a slide amongst a handful of other examples for people to explore further:

screen shot 2018-01-30 at 2 37 20 pm
leonjessen commented 6 years ago

Very nice - I will be looking forward to seeing your talk! Since I am 9 hours ahead of San Diego time, I think I am going for the youtube version though... I just think it's cool that you'll mention our work, even if it is only briefly. Btw. just taught my first university class in data analysis in R yesterday Monday - Naturally, first thing I had the students install RStudio :-)

leonjessen commented 6 years ago

Managed to move some things around, so I can watch your keynote LIVE tomorrow @jjallaire - Looking forward!

leonjessen commented 6 years ago

Thanks for a great talk @jjallaire ! And I am definitely looking forward to the deep learning workshops on #RStudioconf 2019! And once again, thanks for giving me this opportunity - Much appreciated! :-)

jjallaire commented 6 years ago

You bet, thank you very much for your contribution!!!

leonjessen commented 6 years ago

Sure thing - Anytime!

leonjessen commented 4 years ago

Hi @jjallaire ,

People have been contacting me reg. a shape error. I think we need to add a small sentence on altering the reshaping, when running the CNN mentioned at the end of the post. The error people are getting can be fixed like so:

From:

x_train <- array_reshape(x_train, c(nrow(x_train), 9 * 20))
x_test  <- array_reshape(x_test,  c(nrow(x_test), 9 * 20))

To:

x_train <- array_reshape(x_train, c(nrow(x_train), 9, 20, 1))
x_test  <- array_reshape(x_test, c(nrow(x_test), 9, 20, 1))
jjallaire commented 4 years ago

Thanks for letting me know, change made! If you have further tweaks at any point feel free to send a PR here: https://github.com/rstudio/tensorflow-blog/tree/master/_posts/2018-01-29-dl-for-cancer-immunotherapy

leonjessen commented 4 years ago

Arh, I see... Will do that. Just quickly for now, so it works again - The original reshaping for the FFN was fine, it's when we compare with the CNN at the end of the post, that we need to add the alternate reshaping.

So for the FFN:

x_train <- array_reshape(x_train, c(nrow(x_train), 9 * 20))
x_test  <- array_reshape(x_test,  c(nrow(x_test), 9 * 20))

and then add at the end for the CNN:

x_train <- array_reshape(x_train, c(nrow(x_train), 9, 20, 1))
x_test  <- array_reshape(x_test, c(nrow(x_test), 9, 20, 1))

Cheers

jjallaire commented 4 years ago

Not sure I completely follow the additional change required. Could you give me a PR for that just to make sure we get it right?

On Wed, Sep 11, 2019 at 8:00 AM Leon Eyrich Jessen notifications@github.com wrote:

Arh, I see... Will do that. Just quickly for now, so it works again - The original reshaping for the FFN was fine, it's when we compare with the CNN at the end of the post, that we need to add the alternate reshaping.

So for the FFN:

x_train <- array_reshape(x_train, c(nrow(x_train), 9 20)) x_test <- array_reshape(x_test, c(nrow(x_test), 9 20))

and then add at the end for the CNN:

x_train <- array_reshape(x_train, c(nrow(x_train), 9, 20, 1)) x_test <- array_reshape(x_test, c(nrow(x_test), 9, 20, 1))

Cheers

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leonjessen/tensorflow_rstudio_example/issues/3?email_source=notifications&email_token=AAAZPRYZKOWIS247E3ZC6ZLQJDMUJA5CNFSM4EHAT4M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6OHVEQ#issuecomment-530348690, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAZPR33USYS2FJUKGCMQWDQJDMUJANCNFSM4EHAT4MQ .