google-research / adapter-bert

Apache License 2.0
483 stars 49 forks source link

Efficient/low-latency serving with multiple adapters? #3

Closed wassname closed 5 years ago

wassname commented 5 years ago

Thanks for releasing this paper. I must say that I've had similar results to you, where adapters beat linear combinations, conv heads, and every other head I can think of. As a bonus it nearly matches BERT's fine-tuned performance!

A couple of broad questions if that's ok:

The only downside is that if you use adapter models you can't use a single BERT service to provide features to multiple heads (e.g. bert-as-server). Do you have any ideas about efficiently serving multiple adapters? The best thing I can think of is keeping the main model in memory and quickly switching the adapter parameters based on the request.

Also, have you experimented with online/active learning with adapters? It seems like a fruitful area since we don't know how to do online learning well with transformers, but adapters allow you to train with high LR's and fewer parameters.

ghost commented 5 years ago

@wassname Nice paper!

Re. serving: yes, the server must know about the relevant adapter parameters. Two possibilities I can think of are 1) Store adapters for various tasks on the server (like a word embedding matrix), each request needs to specify which adapters it wants to use. One needs to own the service in order to do this. 2) Pass the adapter parameters to the server alongside the input text with the request. This would probably only be effective with very small adapters (which can work well for some tasks).

Re. online learning: are you talking about the setting where each training example is seen only once, then discarded? If so, this is an interesting idea, that we have not tried. Indeed, the fact that one can get away with a more aggressive learning rate might help.

wassname commented 5 years ago

Re: online learning: Similar, it might be stored by you want results right away, for example, online learning. Later you could probably do full retraining like suggested in your paper with weighted sampling. I haven't tried it either but will let you know if I do.

Thanks!