igematberkeley / NLPChemExtractor

3 stars 0 forks source link

Training data creation from BRENDA data using weak model #19

Open mrunalimanj opened 3 years ago

mrunalimanj commented 3 years ago

The # of samples in chemprot that have substrates or products is comparatively low (only about ~150 abstracts that mention them, and about 6x that many in actual sentences that include a substrate or product by their annotation).

Furthermore, there's a concern of whether the training data, which is all from abstracts, is a perfect/ideal representation of the input data, which is largely the bodies of papers (and not abstracts which tend not to be concise).

Therefore, we're thinking of using a subset of the BRENDA data in conjunction with the papers that include those reactions to generate more substrate/enzyme training data that comes straight from literature... how to do this is still a bit unclear but it'd be great to hash this out with someone!

sghandian commented 3 years ago

The approach for this is "distant supervision" to label each word in a sentence (for the BRENDA data) as either substrate, enzyme, or product. The heuristic to label each word is to check the set of listed substrates, products, and enzymes for a particular DOI and label each word by checking to see if the pre-detected chemical names (chemTagger) within that sentence are any of those entities.

Right now, a rough tagger has been created to label substrates (which can be abstracted to label any of the other entities) but it isn't super efficient yet. Working on vectorizing so that runtime is decreased.