Open jstremme opened 2 years ago
Thoughts on a strategy like this with T5? Since you're supposed to tip off the model that it's a translation task via the prompt, I'm curious how it'd perform for our pseudo-translation.
Hopping off for the day now, but I plan to play with this soon.
There might be a way to do this with prompts, but I think the most straightforward way to do this will be to consider this a classification task where either:
For both approaches, I think we'd want to modify the last layer of the network directly, sorta like this. Instead of 2 output features we'd have an output of shape (batch_size
, seq_len
, vocab_size
) where for each sample, and each item in the sequence, we predict a token from the vocab (approach #2). Approach #1 would involve running inference for each token in the input sequence.
My concern with actually framing this as a translation problem (in the way that is typically done with LMs pretrained on multi-language datasets) is that we're just using English. I'm imagining an architecture inspired by translation but not actually translation. Just my current thought on the best way of going about it. No doubt other things could work.
I think we could still use T5, and maybe need to use some sort of seq2seq model instead of BERT or a BERT-like model where the output is usually a single feature not a sequence.
Rad! Will take a look at this stuff this week.
Also realized I forgot the link I was referencing. Here it is.
My concern with actually framing this as a translation problem (in the way that is typically done with LMs pretrained on multi-language datasets) is that we're just using English. I'm imagining an architecture inspired by translation but not actually translation.
Right was thinking the same. At a glance, it didn't look like the hugging face T5 + seq2seq model above was doing anything specific to translating between languages, but I could be missing something?
One thing is that prefix they're recommending (i.e. "translate English to French: ") that I T5 was presumably trained with. Would you assume that'd cause funky behavior if we instead prefixed it with "translate a neuroscience paper to a developmental biology paper: "?
I think so, yeah. We probably need to do light surgery on the actual model architecture of an LM trained on English only text so that the output layer predicts a sequence of tokens. It would be similar to a summarization model without the constraint that the output is much shorter than the input.
A lot of these translation models are pretrained on datasets consisting of multiple languages. Since we're just dealing with English, I think we can start with BERT, pubmedbert, or a similar LM, chop off the layer used for predicting masked tokens, then add a layer that predicts, for each token in the input sequence, the corresponding token in a new sequence. Most tokens will stay the same and with labelled examples of input, target sequence pairs, the model will learn the right replacements.