Controlling length of output sequences

tbepler / prose

Multi-task and masked language model-based protein sequence embedding models.

Other

96 stars 20 forks source link

Controlling length of output sequences #3

Open nprasadmm opened 2 years ago

nprasadmm commented 2 years ago

Thank you for your great work! I was just wondering how I would modify the code to control the output length, as running the SkipLSTM model produces a Tensor of dimensions [386, 6165] (no pooling) when run on my pre-aligned sequences. I would like to produce much shorter representations for each of the 386 sequence components. Thank you.

tbepler commented 2 years ago

The model produces an embedding for each position of the sequence, so a sequence of length L will always produce an LxD (where D is the dimension of the embedding) embedding matrix for the sequence.

If you want a fixed sized embedding, I've found that simply averaging the embeddings over the length of the sequence to produce a D-dimensional embedding works well for many downstream applications.

nprasadmm commented 2 years ago

Right, I'm asking how to reduce D.

tbepler commented 2 years ago

I recommend PCA.

nprasadmm commented 2 years ago

I understand that I can use PCA, I'm just wondering if there's a parameter in the code that I can adjust to control the value of D from the outset.

tbepler commented 2 years ago

If you are training a new model from scratch, you can adjust the hidden dimension and the number of layers to change D. These are already fixed for the pre-trained model though.

irleader commented 1 year ago

I recommend PCA.

Hi,

In the manuscript, you also talk about using PCA to do dimensionality reduction. "We then apply dimensionality reduction to these vectors using PCA down to the minimum of 1000 PCs or the number of data points in the dataset in order to improve the runtime of the learning algorithm."

Did you use torch.pca_lowrank or sklearn.decomposition.PCA?

You also talk about t-SNE, but for visualization only, I assume you prefer PCA over t-SNE on downstream tasks.

There is also a popular approach of using a linear neural network layer to reduce dimensionality, e,g. from 6165 to 512. Do you think this could be as effective as PCA?

Thanks a lot.