Open nprasadmm opened 2 years ago
The model produces an embedding for each position of the sequence, so a sequence of length L will always produce an LxD (where D is the dimension of the embedding) embedding matrix for the sequence.
If you want a fixed sized embedding, I've found that simply averaging the embeddings over the length of the sequence to produce a D-dimensional embedding works well for many downstream applications.
Right, I'm asking how to reduce D.
I recommend PCA.
I understand that I can use PCA, I'm just wondering if there's a parameter in the code that I can adjust to control the value of D from the outset.
If you are training a new model from scratch, you can adjust the hidden dimension and the number of layers to change D. These are already fixed for the pre-trained model though.
I recommend PCA.
Hi,
In the manuscript, you also talk about using PCA to do dimensionality reduction. "We then apply dimensionality reduction to these vectors using PCA down to the minimum of 1000 PCs or the number of data points in the dataset in order to improve the runtime of the learning algorithm."
Did you use torch.pca_lowrank or sklearn.decomposition.PCA?
You also talk about t-SNE, but for visualization only, I assume you prefer PCA over t-SNE on downstream tasks.
There is also a popular approach of using a linear neural network layer to reduce dimensionality, e,g. from 6165 to 512. Do you think this could be as effective as PCA?
Thanks a lot.
Thank you for your great work! I was just wondering how I would modify the code to control the output length, as running the SkipLSTM model produces a Tensor of dimensions [386, 6165] (no pooling) when run on my pre-aligned sequences. I would like to produce much shorter representations for each of the 386 sequence components. Thank you.