sequence oriented model?

jjallaire commented 6 years ago

@leonjessen I tried this variation of the model that treats the peptides as sequences (this is under the assumption that the order of the peptides actually matters, this may or may not be the case, you would know not I!).


library(keras)
library(tidyverse)
library(ggseqlogo)
library(PepTools)

pep_file = get_file("ran_peps_netMHCpan40_predicted_A0201_reduced_cleaned_balanced.tsv", 
                    origin = "https://git.io/vb3Xa") 
pep_dat  = read_tsv(file = pep_file)

pep_encode_sequence <- function(peptides) {
  A <- as.integer(charToRaw("A"))
  t(sapply(peptides, function(peptide) as.integer(charToRaw(peptide)) - A))
}

x_train = pep_dat %>% filter(data_type == 'train') %>% pull(peptide) %>% pep_encode_sequence
y_train = pep_dat %>% filter(data_type == 'train') %>% pull(label_num) %>% array
x_test  = pep_dat %>% filter(data_type == 'test')  %>% pull(peptide)   %>% pep_encode_sequence
y_test  = pep_dat %>% filter(data_type == 'test')  %>% pull(label_num) %>% array

y_train = to_categorical(y_train, num_classes = 3)
y_test  = to_categorical(y_test,  num_classes = 3)

model <- keras_model_sequential() %>% 
  layer_embedding(input_dim = 25, input_length = 9, output_dim = 32) %>% 
  layer_lstm(units = 32, dropout = 0.2, recurrent_dropout = 0.2) %>% 
  layer_dense(units = 3, activation   = 'softmax')

model %>% compile(
  loss      = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics   = c('accuracy')
)

# gets to ~ 94.5% validation accuracy
history = model %>% fit(
  x_train, y_train, 
  epochs = 500, batch_size = 50, validation_split = 0.2
)

model %>% evaluate(x_test, y_test)
## $loss
## [1] 0.1397115

## $acc
## [1] 0.9393939

It looks to me like a slight improvement, however this could just be the sequence oriented model just reducing to the computational equivalent of the MLP model.

I also don't know whether the embedding layer + LSTM is the "right" way to approach sequence data of this size and nature (I am a complete amateur in these matters so you should defer to your more experienced colleagues as far as what sequence-based architectures are worth exploring, if in fact the sequence has any meaning in the first place).

leonjessen commented 6 years ago

I have worked on updating the PepTools package with documentation and additional functions. PepTools now contain a function pep_plot_images(pep_ran(n = 100, k = 9)), which will plot the peptides encoded as 'images':

It is basically a visualisation of the pep_encode() function.

The encoding relies on evolution on protein level, where the biochemical differences between different amino acids make it more or less likely that the protein will "survive" a given mutation. E.g. 'W' ('W'=>'W' is black in the images). So the more black in the 'images', the less likely a mutation is. This information is captured in the above "images" - I suppose it is comparable to a QR-code. Therefore sparse encoding, such that 'E' become 4 or does not capture the essential evolutionary/biochemical information.

I hope that made sense @jjallaire ?

I will work on creating a CNN in keras to capture the information in the 'images'. I expect that it will improve the accuracy.

...and thanks for your interest, I'm quite enjoying this 👍

jjallaire commented 6 years ago

Awesome!! We should definitely add a call to your pep_plot_images function to allow the reader to visualize what pep_encode is doing.

You could also add an S3 class to the return value of pep_encode and then create a plot method for it. So we could write:

plot(pep_encode("STDLCNKAR"))

jjallaire commented 6 years ago

One other thing we should do at the outset is give a nod to what methods might traditionally be used here (e.g. a random forest?) before exploring DL approaches. I think this will help address the (rightly) skeptical viewpoint that DL isn't always beneficial. What we want to say is, yes, you can use other approaches to get good results but we have reason to believe that with enough work we can get better results with DL (even if this particular post doesn't get all the way there).

leonjessen commented 6 years ago

I think that it's perfectly valid to include RF. I am thinking, that I will compare RF with FFWD and CNN and then hope, that perf(RF) < perf(FFWD) < perf(CNN) - We'll see. I have completed documenting PepTools, both functions and included data and also caught a couple of bugs in the process. I will look into the CNN and RF now and return asap.

jjallaire commented 6 years ago

Excellent! Even if the CNN has worse performance than the RF or the FFWD, I think you could make the case that the CNN has the potential (in principle) to be better if it was architected correctly.

jjallaire commented 6 years ago

Checking back in to see if you've made any more progress here and have interest in publishing the post on the TensorFlow for R blog.

leonjessen commented 6 years ago

Absolutely @jjallaire - Apologies for the delay. December was spend wrapping up work before Christmas and I am teaching immunological bioinformatics full-time in January. I will continue to work on this as much as I possibly can! I think it is a very good and relevant use-case. Be back asap!

jjallaire commented 6 years ago

Okay, great to hear!

leonjessen commented 6 years ago

Almost done with teaching @jjallaire ... Exams Friday and then I will allocate some time to this!

jjallaire commented 6 years ago

Awesome, that's great to hear!

jjallaire commented 6 years ago

I'm working on a Gallery of featured examples on the TensorFlow for R site here:

https://tensorflow.rstudio.com/learn/gallery.html

Yours would be the first bio related article so would be a very welcome addition! Let me know what I can do to help (hoping to have the article in place before my talk at rstudio::conf in a couple of weeks).

leonjessen commented 6 years ago

I will return with more within this week @jjallaire

leonjessen commented 6 years ago

...and the gallery looks very nice @jjallaire

leonjessen commented 6 years ago

Running a random forest on the same set of data as the std. feed-forward fully connected deep network yields ~81% accuracy vs the ~94% for the ANN @jjallaire . Code in /R. Will look into the CNN now...

Versus

jjallaire commented 6 years ago

Okay, great to hear.

I have a feeling that the key to getting the CNN to work well will be tuning the architecture and hyper-parameters to find the combination that is "just right". In case you weren't aware of them here are a couple of tools to automate hyper-parameter tuning:

1) tfruns package: https://tensorflow.rstudio.com/tools/tfruns/articles/tuning.html (see especially the tuning_run function which will automatically try a grid of hyper-parameters).

2) CloudML: https://tensorflow.rstudio.com/tools/cloudml/articles/tuning.html

CloudML is also a great way to get access to a GPU for training if you don't already have one: https://tensorflow.rstudio.com/tools/cloudml/articles/getting_started.html#training-with-a-gpu

Note that Google will give you a $500 credit ($300 for new account, $200 special credit for R users) for CloudML so I'm guessing that using it for this project wouldn't cost you anything.

If you can beat the ANN with the CNN and hyperparameter tuning ends up helping you find the right model it would be interesting to add that to the write-up.

jjallaire commented 6 years ago

If you get to a CNN architecture that gets close to the FFN in accuracy let me know and I can give you a PR that demonstrates how to use tfruns and/or cloudml to do a hyperparameter search.

As I said before, even if we get the CNN performing about the same as the FFN I think we could link to the paper you referenced (http://www.cbs.dtu.dk/services/NetMHCpan/) and say that we expect additional investigation would very likely lead to improved accuracy.

leonjessen commented 6 years ago

Hi @jjallaire,

I have now included CNN code in R/ and the performance is comparable with the FNN:

The netMHCpan is absolute state-of-the-art with respect to modelling the system. As I see it, this example is a simplified illustration on how deep learning is being used to predict molecular interactions essentiel for medical research? For this example, we will not achieve the same performance as netMHCpan as the architecture is much more complicated and also incorporates data from different sources.

Thanks for the tip on cloudML - I am running this project on my laptop.

Further thoughts?

jjallaire commented 6 years ago

I agree, we don't need to beat the FNN for this article. I think if we just link to the netMHCpan paper and say that with additional effort there's a good chance we could improve things.

Let's add the CNN to the post (along with the link to netMHCpan) and then I think we are ready to publish! (let me know when it's ready from your standpoint and I'll integrate it into our blog's hugo site and publish a draft for one final review).

BTW I am planning for my keynote at rstudio::conf and am strongly considering using this as a case study. I actually think it's especially important to illustrate that using CNNs or other more complex NN architectures typically don't give you any wins for free (especially as you get outside of traditional computer vision) but that at the same time there is a frontier that can be / will be discovered where you do get a much better model.

leonjessen commented 6 years ago

I think it'd be awesome should you choose to present this use case - Naturally, I would be honoured!

The netMHCpan is the result of many years work of my professor and is unmatched in its particular niche. It incorporates custom algorithms like the NNAlign but the gist is that it is an example of successfully applying Deep Learning to model complex molecular interactions. DL is well suited to this due the high degree of non-linearity and context-dependent responses.

I will add the CNN to the post and write it up as soon as possible!

jjallaire commented 6 years ago

Awesome! I think if you just cite the paper and describe why DL is as you said well suited to this problem and likely to yield improved results with more effort that will be great.

Let me know when it's good to go and I'll publish!

leonjessen commented 6 years ago

I have cleaned up the scripts in R/. I will edit the markdown post now...

jjallaire commented 6 years ago

Awesome, looking forward to publishing this! (will likely be on Monday assuming your markdown tweaks land before then).

leonjessen commented 6 years ago

Great! What do you think - Should I include the entire CNN approach? I mean it's 99% identical to the FFN, except for an initial conv-layer in the model architecture (See code in R/). I am thinking that we could describe that we repeated the analysis using a CNN with one conv-layer and then show the results? (To keep the post relatively short)

jjallaire commented 6 years ago

Yes, I think just showing the model definition with the extra CNN layer and then referencing the results is fine. We should make the point (with reference to the aforementioned paper) that we expect that with iteration we can get the CNN to work better (and why we expect this -- e.g. what in the nature of the data causes us to think that convolutional filters will add to predictive capacity)

leonjessen commented 6 years ago

The thing is, I am not confident that CNNs necessarily will increase the performance in our case. As I see it CNNs are extremely powerful in edge-detection in images and hence extracting "collections" of features forming e.g. a mouth. However in our case we don't really have this structure in the image, e.g.: As you also stated CNNs are mainly applicable to computer vision?

jjallaire commented 6 years ago

Okay, let's drop the CNN bit then as we don't want falsely assert that this might be a good direction when it might in fact be a dead end!

Perhaps instead just some speculation about what methods (including changes to FF network architecture) might improve things?

jjallaire commented 6 years ago

It might also be interesting to show that we tried a CNN and then discuss why this might not work at all (important for us to demonstrate that there are many blind alleys in deep learning!)

On Fri, Jan 26, 2018 at 7:10 AM, Leon Eyrich Jessen < notifications@github.com> wrote:

The thing is, I am not confident that CNNs necessarily will increase the performance in our case. As I see it CNNs are extremely powerful in edge-detection in images and hence extracting "collections" of features forming e.g. a mouth. However in our case we don't really have this structure in the image, e.g.: [image: image] https://user-images.githubusercontent.com/11975450/35439178-e1e77d24-0299-11e8-9e65-98d302777285.png As you also stated CNNs are mainly applicable to computer vision?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leonjessen/tensorflow_rstudio_example/issues/3#issuecomment-360769213, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx-o0KtNMDwe8w9_2FUl4Myr0bdoWks5tOcCugaJpZM4Q4J8Z .

leonjessen commented 6 years ago

Exactly! As I see it, the major obstacle in DL is people getting lost in architecture/hyperparameter space! Of course knowing what it is you are modeling helps a lot (prior knowledge like), but currently you need experience (and luck) to get the last bit of performance out of your model

leonjessen commented 6 years ago

...and you can easily spend months pursuing dead ends!

leonjessen commented 6 years ago

Will put some time into finishing up tonight (CET)...

jjallaire commented 6 years ago

Awesome, I am working on my talk today and was hoping I'd have your contribution to reference! Once you are done let me know and I'll publish it.

On Sun, Jan 28, 2018 at 10:50 AM, Leon Eyrich Jessen < notifications@github.com> wrote:

Will put some time into finishing up tonight (CET)...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leonjessen/tensorflow_rstudio_example/issues/3#issuecomment-361072698, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx3am730acBuN2MfdHrK75SyBKVbOks5tPJdSgaJpZM4Q4J8Z .

leonjessen commented 6 years ago

I'll ping you as soon as I have something for you... Family dinner now and then DL! :-)

leonjessen commented 6 years ago

Btw. when we build models, a huge amount of work goes into creating balanced training and test sets and training the model in cross-validation scenario, usually 5-fold. We then save each of the five models and create an ensemble prediction - wisdom-of-the-crowd. We are very careful with avoiding overfitting as this of course decreases the models extrapolation performance. Should I touch upon this briefly also?

jjallaire commented 6 years ago

Yeah, I think it's worth at least describing the general practice (I don't think you need to show it in code though)

On Sun, Jan 28, 2018 at 12:45 PM, Leon Eyrich Jessen < notifications@github.com> wrote:

Btw. when we build models, a huge amount of work goes into creating balanced training and test sets and training the model in cross-validation scenario, usually 5-fold. We then save each of the five models and create an ensemble prediction - wisdom-of-the-crowd. We are very careful with avoiding overfitting as this of course decreases the models extrapolation performance. Should I touch upon this briefly also?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leonjessen/tensorflow_rstudio_example/issues/3#issuecomment-361081097, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx54HZa2lfSOG1p5zyEzznELxmbpeks5tPLI4gaJpZM4Q4J8Z .

leonjessen commented 6 years ago

Ok

leonjessen commented 6 years ago

Almost there...

leonjessen commented 6 years ago

Now

leonjessen commented 6 years ago

Ok @jjallaire , so changed quite a lot. Also tried to include the things we have discussed in this thread. Please do let me know how you think it turned out and please feel free to edit and move around as you see fit, so the post will comply with your original intent.

I'm not sure if the post is now too long?

jjallaire commented 6 years ago

Excellent! I'll do a detailed review tomorrow morning first thing. I think it's okay if these posts are fairly long as the people reading more than a few paragraphs are typically also motivated to read on. Length = more exploration of all of the subtleties of the problem space which is a good thing!

leonjessen commented 6 years ago

Agreed, it is always difficult to limit the content, while maintaining the essentials. Let me know, when I should take a look at it again!

jjallaire commented 6 years ago

Okay, a draft of the post is published here: https://broker-crocodile-46084.netlify.com/blog/deep-learning-cancer-immunotherapy

I made only minor changes and corrections. Take a look and let me know if it's okay to publish.

I also provided a summary of the post on the gallery here (let me know if it looks okay): https://broker-crocodile-46084.netlify.com/learn/gallery

leonjessen commented 6 years ago

I think it looks quite nice now @jjallaire ? Please feel free to check the language, English is not my first language. I still think the post is a bit long, but if you're ok with it, I think we should move ahead and publish it :-)

jjallaire commented 6 years ago

Okay, great. I did a language and spelling check and I think it's all good. Will publish-away!

On Mon, Jan 29, 2018 at 2:10 PM, Leon Eyrich Jessen < notifications@github.com> wrote:

I think it looks quite nice now @jjallaire https://github.com/jjallaire ? Please feel free to check the language, English is not my first language. I still think the post is a bit long, but if you're ok with it, I think we should move ahead and publish it :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leonjessen/tensorflow_rstudio_example/issues/3#issuecomment-361352841, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx543S8ulgBqG0RYUR5f4hUqqsKtzks5tPhe8gaJpZM4Q4J8Z .

leonjessen commented 6 years ago

Fantastic! Enjoy rstudio::conf!

jjallaire commented 6 years ago

Thanks, will do! Published the post and announced on Twitter here: https://twitter.com/rstudio/status/958056290261094405

On Mon, Jan 29, 2018 at 2:19 PM, Leon Eyrich Jessen < notifications@github.com> wrote:

Fantastic! Enjoy rstudio::conf!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leonjessen/tensorflow_rstudio_example/issues/3#issuecomment-361355628, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXxw853zXjWhLxR2hqxlFBAJ8dKaiOks5tPhndgaJpZM4Q4J8Z .

leonjessen commented 6 years ago

Amazing @jjallaire !!! Please feel free to add my twitter account @jessenleon to the announcement (If it's not too late)

leonjessen commented 6 years ago

One last thing @jjallaire - Could you merge your updates, so that the repo reflects the final version?

jjallaire commented 6 years ago

The only hangup is that posts on the tensorflow blog have eval = FALSE (so we don't have to constantly re-run all of the R code when publishing) so all of the images/figures are static.

How about if I create a branch with the revised Rmd source + images and then you can decide how you want to merge this (preserving R code execution vs. going with eval = FALSE).

leonjessen commented 6 years ago

Ah I see... Yes, that would be fine and then I can have a look at it as you suggest

jjallaire commented 6 years ago

Okay, branch is here: https://github.com/leonjessen/tensorflow_rstudio_example/tree/jj-blog-edits

On Mon, Jan 29, 2018 at 3:29 PM, Leon Eyrich Jessen < notifications@github.com> wrote:

Ah I see... Yes, that would be fine and then I can have a look at it as you suggest

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leonjessen/tensorflow_rstudio_example/issues/3#issuecomment-361376014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx2sfwyri6cShCf8Cq_Sw7P4RcyVOks5tPioxgaJpZM4Q4J8Z .

leonjessen / tensorflow_rstudio_example

sequence oriented model? #3