vkola-lab / peds2019

Quantifying the nativeness of antibody sequences using long short-term memory networks
MIT License
16 stars 6 forks source link

AbLSTM.py returns scores in different order #4

Closed prihoda closed 3 years ago

prihoda commented 3 years ago

Hi all, I noticed a very important issue - the ablstm.py script returns scores in a different order than the order of the input sequences.

I tried processing a diverse set of sequences in one file (human, humanized and murine therapeutic sequences) and got scores that were not consistent with your published distributions:

image

At first I thought it was an overfitting issue, but then I found that I am getting a different result when processing just the first few sequences. When I processed the sequences one by one, the scores now fall into the expected ranges:

image

xf3227 commented 3 years ago

Hi prihoda. Thank you for the comment! I have encountered similar issues caused by the inconsistent mechanisms of random number generation across different environments. Since we was also processing the sequences one by one during the testing stage, we failed to notice this bug. I will try fixing it and get back to you soon.

xf3227 commented 3 years ago

In the eval() function, I accidently made the dataloader shuffle the sequences. Thank you for pointing this out. It will also be greatly appreciated that you could help us test the code again to see if the issue has been resolved.

prihoda commented 3 years ago

Hi @xf3227, thanks for the quick fix. I am now getting the same result when running one by one as when running the whole file 👍 You can close this issue.

Btw a side note, in terms of usability, I think users might find useful to have some instructions on producing the AHo aligned input files. You could even include a script, since it takes a few steps (running anarci to produce an aligned CSV and then converting that CSV to txt while making sure that the same positions as in your input files are present).

Anarci will only include positions that exist within your processed set of sequences, so here's what I got from the ANARCI CSV on my set of sequences:

QVQLKES-GPGLVAPSQSLSITCTVSG-FSVTN-----YGVHWVRQPPGKGLEWLGVIWA----GGITNYNSAFMSRLSISKDNSKSQVFLKMNSLQIDDTAMYYCASRGGHY-------------------GYALDYWGQGTSVTVSS

I then needed to insert the gaps at the correct positions:

-QVQLKES-GPGLVAPSQSLSITCTVSG-FSVTN-----YGVHWVRQPPGKGLEWLGVIWA----GGITNYNSAFMSRLSISKDNSKSQVFLKMNSLQIDDTAMYYCASRGGHY-------------------GYALDYWGQGTSVTVSS
xf3227 commented 3 years ago

Hi @prihoda, thank you for locating this bug. I just closed this thread.

As to sequence alignment, sorry that I was not the guy handling this part, neither am I experienced on using sequence alignment tools. Two possible solutions could be:

  1. Simply remove gaps from all sequences. The model can run under two modes one of which is to handle unaligned sequences, although the performance may be expected to be a bit poorer.

  2. Create user's own training dataset aligned in any specific format.

Of course, thank you for bringing this up! Hope this repo could help with your researches and projects!