nsndimt / CatalysisIE

3 stars 0 forks source link

Active learning #1

Open DianaCH1 opened 6 months ago

DianaCH1 commented 6 months ago

Good day,

Thank you for this great model!

Currently, I am attempting to reproduce your code, and try to train the model on the extended dataset, create a new one using your techniques. The question I encountered is how you performed the active learning part. I could not find the corresponding source code in your model. Did I miss it somewhere, or was it intentionally not uploaded to the repository? Could you assist me with that or demonstrate how I can execute it using your model?

Thank you in advance!

nsndimt commented 6 months ago

Active learning is not part of model training/test. It is a technique I used to help collect training data more efficiently. It prioritize the annotation of certain text and is useful when one kind of entities is rare to find in your text. You can check wiki if you are not familiar with it

You first prepare some unannotated text and run model on it. You need to use the model's output for each span as a probability distribution over all possible span types. Then you rank different spans according to one type of class probability e.g. catalyst. Then you got a list where the top parts have high model probability and are likely to be true positive, the bottom parts have low model probability and are likely to be true negative. Therefore we would like to prioritize the annotation of the middle part first.

https://github.com/nsndimt/CatalysisIE/blob/e476bbf4d1faee614f92cd08161ecf38f66baddf/model.py#L155