kamalkraj / BioELECTRA

BioELECTRA
Apache License 2.0
51 stars 7 forks source link

Similarity task usage #3

Open saeedehkhoshgoftar opened 3 years ago

saeedehkhoshgoftar commented 3 years ago

I'm interested in applying the promising pretained model to the similarity task. It was unclear from the example provided on HuggingFace how it might be utilized for similarity tasks. Two sentences are provided in the Huggingface example, but only one is utilized as an input to the discriminator. However, in the original publication, it was mentioned that the pair of sentences is specified as "[CLS]sentence1[SEP]sentence2[SEP]". Could you please give further information or an example demonstrating how the architecture may be utilized to do similarity tasks?

discriminator = ElectraForPreTraining.from_pretrained("kamalkraj/bioelectra-base-discriminator-pubmed") tokenizer = ElectraTokenizerFast.from_pretrained("kamalkraj/bioelectra-base-discriminator-pubmed")

sentence = "The quick brown fox jumps over the lazy dog" fake_sentence = "The quick brown fox fake over the lazy dog"

fake_tokens = tokenizer.tokenize(fake_sentence) fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt") discriminator_outputs = discriminator(fake_inputs) predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)

kamalkraj commented 3 years ago

For doing similarity tasks it is better to use sentence transformers. You can fine-tune BioElectra or any transformer model from huggingface from https://www.sbert.net/