Slow encoding - Githubissues

tianshilu / pMTnet

Deep Learning the T Cell Receptor Binding Specificity of Neoantigen

GNU General Public License v2.0

76 stars 20 forks source link

Slow encoding #1

Closed Akazhiel closed 2 years ago

Akazhiel commented 2 years ago

Greetings!

Great tool to help predict the TCR-pMHC bindings although, is there any way to speed up the encoding step? Since I understand the aim of this tool is to predict how well your TCR repertoire binds to the predicted pMHCs, the encoding is far slower than what I'd expect. Given you'd pair each TCR to the whole list of pMHCs to test for binding, this would generate files of millions of lines. Currently I'm running it on a file with 2M lines and it's been almost 3 days of running time and the encoding is not even close to be done. Maybe it's not expected to use as input all the possible combinations but just some of them? In that case how would you select them?

Best regards,

Jonatan

tianshilu commented 2 years ago

Hi @Akazhiel ,

Thanks for your interest!

One selection step that you can do before encoding is to run all pMHCs through netMHCpan and only keep the pMHCs with a satisfactory rank (e.g. 2%). Then input the pMHCs with TCRs to pMTnet. We are also working on computationally speeding up the encoding process.

Best, Tianshi

Akazhiel commented 2 years ago

Hello @tianshilu ,

Yes we do run the pMHCs through an algorithm different than netMHCpan and filter them by the affinity percentile. My question was more towards how (if possible) to reduce the number of candidates TCRs. Since you'd want to screen each TCR against all the pMHCs.

Cheers,

Jonatan

tianshilu commented 2 years ago

Hi @Akazhiel ,

Sorry that we don't have a pre-selection step for TCRs. We are working on speeding up the encoding and prediction. Thanks very much for your feed back!!

Tianshi

Akazhiel commented 2 years ago

Hi @tianshilu

That's totally understandable, indeed subsetting the TCRs might be a really hard feat to achieve. I've been tinkering with the code and sped up the encoding steps that take place previous to the encoding with the autoencoder since my knowledge and capabilities regarding machine learning are pretty limited and wouldn't know how to speed up the autoencoder or the predictions.

If it's okay with you I'll open a pull request so that you can review the code. I've done some testing and the TCRmap, antigenMap and HLAMap together take less than one minute for a dataset of 2M rows, the bottleneck of the software now for large datasets is as I've mentioned the prediction step since it needs to loop through each value.

Cheers,

Jonatan

tianshilu commented 2 years ago

Hi @Akazhiel ,

Thanks for your effort on this. Please feel free to open a pull request!

Thanks!

Tianshi

tianshilu commented 2 years ago

The encoding part has been updated for faster encoding speed.