havakv / pycox

Survival analysis with PyTorch
BSD 2-Clause "Simplified" License
781 stars 180 forks source link

Question about define the embedding sizes #84

Closed hanxiaozhen2017 closed 3 years ago

hanxiaozhen2017 commented 3 years ago

Hello everyone, I am new to pycox and currently learning the script 02_introduction.ipynb I don't understand why we need to let the embedding dimensions be half the size of the number of categories. This means that each category is represented by a vector that is half the size of the number of categories.

For my understanding, if we have 3 category, it is clear to represent by a vector of size 3 : (1,0,0) (0,1,0) and (0,0,1) for each of the category. How and why should we make each category to be represented by a vector of half size of the number of categories?

Thanks for help answering this.

havakv commented 3 years ago

Hi. So, you can choose whatever embedding size you want. It is completely arbitrary that the example use half the size of the number of categories. What you're suggesting is called one-hot-encoding, which is the standard way to encode categorical variables. Entity embedding on the other hand, are embeddings that are learned together with the rest of the network. This has been found to (sometimes) be very beneficial, partly because it can vastly reduce the size of the input space. They are a good tool to have available for building neural networks, so I'd advice you to check them out. There are plenty of resources online (such as https://towardsdatascience.com/entity-embeddings-for-ml-2387eb68e49).

Hope this answers your question.

hanxiaozhen2017 commented 3 years ago

@havakv Many many thanks! I am not familiar with this field. Your reply is very clear and helpful. thanks again!