stanford-crfm / BioMedLM

590 stars 61 forks source link

zero-shot keyword extraction #7

Open uyaseen opened 1 year ago

uyaseen commented 1 year ago

Hello, I am planning to use pubmedgpt for zero-shot keyword extraction on a biomedical text. On my (proprietary dataset) GPT-3 has demonstrated pretty decent performance for keyword extraction; I wanted to get your thoughts on zero-shot generalization capabilities of pubmedgpt? especially for tasks such as keyword extraction? Also, can you point to helpful prompt(s) format optimized for pubmedgpt?

Many thanks!

J38 commented 1 year ago

I am not sure if this is going to work with zero-shot. But I will note that in the training corpus I do see examples at the end of abstracts and articles with "Keywords: ..." ... So you could try appending "Keywords: " to the end of your example and seeing what happens. I think the key for this to work is finding a format that appears repeatedly in the Pubmed training data ... another direction is to briefly fine tune ... I think if you came up with several hundred examples and fine tuned on those it should produce reasonable results. We could help if you have trouble getting access to compute for the fine tuning ... I'll try to explore this for keyword as well and let you know what I see !

J38 commented 1 year ago

I am not sure how many training examples are needed for fine tuning, but on the MeQSum task there are only 500 examples, so it is possible with a relatively small training set the model could be fine tuned to do the right thing ...

J38 commented 1 year ago

My advice would be to look at the PubMed abstracts and articles and look at what patterns involve "Keywords" ... I see things like "Keywords used", "Keywords: ", "Keywords included " ... I think if this works zero-shot it would be because there is a common pattern in the PubMed abstracts and articles.

J38 commented 1 year ago

Here is a real PubMed example for instance:

Identification of the experimental herbaceous host range of the Apscaviroids infecting citrus species.\nCitrus viroid V (CVd-V), citrus dwarfing viroid (CDVd) and citrus bent leaf viroid (CBLVd) (the genus Apscaviroid, the family Pospiviroidae) have been reported to be restricted to citrus species naturally. The herbaceous host range of these viroids was identified using the viroids infectious clones. Several herbaceous plants from the Cucurbitaceae, Solanaceae, Fabaceae and Asteraceae families were found to be susceptible to CVd-V, CDVd and CBLVd. Also, the viroids could be transferred to these hosts through rubbing of monomeric DNA plasmids and through mechanical inoculation of infected sap. Keywords: citrus; viroid; host range; CVd-V; CBLVd; CDVd.
J38 commented 1 year ago

And remember our model only has context length of 1024, so you need to break up input into 1024 blocks ... so if you had long input it'd be better to get keywords for each section and then combine in the end ...

uyaseen commented 1 year ago

Hi @J38,

Many thanks for your reply, it was extremely helpful. Actually, you are right about the out-of-the-box zero-shot performance, the initial results don't look very promising. I still have to search for more prompt patterns, maybe the results get improved. Unfortunately, we only have (very) limited data and that is why we were hoping for a zero-shot setup to work out, but in the worst case, we will try to get additional data and get it annotated. Thankfully, we have access to compute but thank you for offering your support, really appreciate it!