vkola-lab / peds2019

Quantifying the nativeness of antibody sequences using long short-term memory networks
MIT License
16 stars 7 forks source link

Filter for unique sequences? #5

Closed wjs20 closed 3 years ago

wjs20 commented 3 years ago

Hi vkola-lab

Thanks for providing the code you used to implement your model! I have a question about the data preparation. before using your sequences as input to your model, do you filter for unique sequences? or retain duplicates? the oas database seems to contain lots of duplicates in the paired section which is what I'm interested in modelling.

Thanks!

xf3227 commented 3 years ago

Hi wjs20,

Thank you for bringing this up! Sorry that the author who prepared the data may not respond to the listed issues of this repo. I personally has checked the data and found no duplications. Please let me know if there is any, I will add a note to the README file.

From the perspective of machine learning, filtering out duplications is always the default option since model training will be biased due to those duplicated samples contributing more than others to parameter updating. Of course, if duplication is not a severe problem, with or without filtering won't affect the model performance too much.

wjs20 commented 3 years ago

Ok thankyou! Also is there any reason why the paper focuses on just the VH and does not use the paired sequences database? I understand that there are more sequences available in the unpaired database but do you not lose some important information about the distribution of heavy light pairing if you just look at VH or VL independantly?

Thanks.

SenyorDrew commented 3 years ago

Hi wjs20, I'm one of the authors on the paper and the one that prepared the data. I considered using paired sequences as you suggested, but ultimately decided on treating VH and VL independently for the following reasons:

  1. As you indicated, there are far fewer paired sequences publicly available (more so at the time of sequence collection) and initially it was unclear how many sequences would be required for training.
  2. One of the things we were interested in was to compare human antibody sequences to those from other species, such as mice. (my main intent was to use this tool to aid humanization efforts by better scoring "humanness"). There are much fewer paired ab sequences available for other species than there are for human.
  3. It was unclear whether the LSTM architecture would be able to detect some of the longer range interactions such as those between heavy and light chains. My own personal thought is that this architecture is not likely to be as effective in this scenario. I looked at linkages between CDRs in heavy chains (unpublished) to see if we could detect coupling preferences, and at best the coupling was very weak.
  4. My understanding of the general consensus in the field is that while there may be some preference for pairing of specific germlines (VH+VL), at best this is weak. Perhaps this observation will change as the number of paired sequences continues to increase.

With all that, I don't believe there's any reason you can't treat VH and VL as paired sequences. In addition to looking for pairing preferences, it would be interesting to look at the coupling of HCDR3 and LCDR3 as there are structurally proximal. If I was doing this, one thing I'd consider doing is add a constant "linker" between the VH and VL, almost treating the antibody as an scFv.

Cheers

wjs20 commented 3 years ago

Hi SenyorDrew

At the moment I am looking at using a transformer architecture to assess nativeness to aid humanization of antibodies. I have done some unsupervised pretraining on paired sequences from the OAS database that I just concatenated.

I might see if the species classifier I build on top of it can differentiate natural paired sequences in my test set from randomly paired heavy and light chains taken from the unpaired section of OAS. If it scored them as 'less human' it would suggest some important pairing information is lost with random pairing.

...Although the paired database does contain lots of samples from diseased patients so may not be that representative of a natural human repertoire.

Have you done any work looking at how the nativeness score correlates with biophysical properties of antibodies? aggregation potential, stability etc.

Thanks

SenyorDrew commented 3 years ago

Hi wjs20, I think it would be really interesting to use a transformer, as it likely has the ability to capture the long range effects that you're looking for. I like your idea of comparing nativeness scores of randomly paired chains vs. natively paired chains - perhaps shuffling the chains in your natively paired dataset to generate the randomly paired chains might help alleviate your concern of bias in the OAS database.

We've thought quite a bit about finding in silico scores that might correlate with biophysical properties, such as aggregation, Tm, or self-association. Unfortunately, most of the publicly available datasets are actually quite small and from what I've heard from multiple people at conferences, different labs will get quite different "results", even for the same antibody sequences, suggesting that experimental conditions (buffers, etc) and protocols are important. I do think it's likely that, overall, more "native-like" antibodies are better behaved (see highly mutated anti-HIV antibodies which have large numbers of mutations and are difficult to work with), but I think the sequence-function landscape will be very rough and so nativeness scores are not likely to be good predictors of biophysical properties.

Cheers

wjs20 commented 3 years ago

Yes the largest dataset I could find with ground truth labels for antibody developability profiles was the jain et al 2017 dataset with 137. I had a go at training a classifier using the features extracted from my pretrained transformer but it just ended up overfitting on the training data. The sequences didn't look linearly separable when I ran a PCA so I wasn't sure how to proceed with it.

During pretraining the model was very accurate in infilling masked tokens, and seemed to display understanding of biophysical characteristics of amino acids (i.e. mistaking small amino acids for other small amino acids, or hydrophobic for hydrophobic) so its unclear why that doesn't translate to an ability to differentiate unstable from stable sequences.

But you're probably right. Some measure of nativeness is probably the best we will be able to do for now. Until some more high throughput methods come along to give us bigger datasets to train on.

Cheers