gmichalo / UmlsBERT

MIT License
94 stars 16 forks source link

Is it possible to have access to the NER part directly? #5

Open alexjout opened 3 years ago

alexjout commented 3 years ago

Hello and thank you for your great work.

I would like to test UmlsBERT on a self supervised model to learn sentence embeddings. I'm looking to simply use your checkpoint to do it. But is it possible for me to have access to the NER part? which given a tokenized sentence, classify each token in the 42 UMLS categories to then use the umls embeddings of BERT that you propose. I saw in the paper that you mention Ctakes to do it but I don't see where it appears in the repository. The best for me would be to simply download your checkpoint (which I did) and then incorporate a part of your code to my other repo to test it. I looked at the NER jupyter file but I don't see the inference part of tokenized sentence with Ctakes.

I'll be so grateful for any advice on this.

Update : I have seen that vocab_updated.txt is in the repo and corresponds to the umls tags. So if I understand right, the sentences don't go trough a custom NER network (like LSTM or other) to get the classes but if a sentence contains tokens classified as tui they will have a special embeddings right?

Alexandre

gmichalo commented 2 years ago

Thank you for your interest in UmlsBert.

Yes, you are correct. In this version of UmlsBERT we identified medical words in the vocabulary of the BERT model. When you pass a sentence if it contains a specific medical token then the model will also provide the token and the tui embedding