Implement translation-based approach to convert neuro papers to bio papers

jstremme commented 2 years ago

A lot of these translation models are pretrained on datasets consisting of multiple languages. Since we're just dealing with English, I think we can start with BERT, pubmedbert, or a similar LM, chop off the layer used for predicting masked tokens, then add a layer that predicts, for each token in the input sequence, the corresponding token in a new sequence. Most tokens will stay the same and with labelled examples of input, target sequence pairs, the model will learn the right replacements.

listenaddress commented 2 years ago

Thoughts on a strategy like this with T5? Since you're supposed to tip off the model that it's a translation task via the prompt, I'm curious how it'd perform for our pseudo-translation.

Hopping off for the day now, but I plan to play with this soon.

jstremme commented 2 years ago

There might be a way to do this with prompts, but I think the most straightforward way to do this will be to consider this a classification task where either:

For each token in the input sequence we predict the corresponding token in the output sequence. This approach would be slightly limited in that only individual words can be replaced/modified. Input sequences would be of fixed length with padding as would be the output sequences.
Predict a target sequence of fixed length with padding. The model would learn from the labelled examples that most of the words in the target sequence are the same words as the input sequence.

For both approaches, I think we'd want to modify the last layer of the network directly, sorta like this. Instead of 2 output features we'd have an output of shape (batch_size, seq_len, vocab_size) where for each sample, and each item in the sequence, we predict a token from the vocab (approach #2). Approach #1 would involve running inference for each token in the input sequence.

My concern with actually framing this as a translation problem (in the way that is typically done with LMs pretrained on multi-language datasets) is that we're just using English. I'm imagining an architecture inspired by translation but not actually translation. Just my current thought on the best way of going about it. No doubt other things could work.

jstremme commented 2 years ago

I think we could still use T5, and maybe need to use some sort of seq2seq model instead of BERT or a BERT-like model where the output is usually a single feature not a sequence.

jstremme commented 2 years ago

Last thought for today... lots of ramblings :) Here's a seq2seq version of BERT that people seem to be using:

listenaddress commented 2 years ago

Rad! Will take a look at this stuff this week.

Also realized I forgot the link I was referencing. Here it is.

listenaddress commented 2 years ago

My concern with actually framing this as a translation problem (in the way that is typically done with LMs pretrained on multi-language datasets) is that we're just using English. I'm imagining an architecture inspired by translation but not actually translation.

Right was thinking the same. At a glance, it didn't look like the hugging face T5 + seq2seq model above was doing anything specific to translating between languages, but I could be missing something?

listenaddress commented 2 years ago

One thing is that prefix they're recommending (i.e. "translate English to French: ") that I T5 was presumably trained with. Would you assume that'd cause funky behavior if we instead prefixed it with "translate a neuroscience paper to a developmental biology paper: "?

jstremme commented 2 years ago

I think so, yeah. We probably need to do light surgery on the actual model architecture of an LM trained on English only text so that the output layer predicts a sequence of tokens. It would be similar to a summarization model without the constraint that the output is much shorter than the input.

jstremme / paper-producer

Implement translation-based approach to convert neuro papers to bio papers #5