Training data creation from BRENDA data using weak model

The # of samples in chemprot that have substrates or products is comparatively low (only about ~150 abstracts that mention them, and about 6x that many in actual sentences that include a substrate or product by their annotation).

Furthermore, there's a concern of whether the training data, which is all from abstracts, is a perfect/ideal representation of the input data, which is largely the bodies of papers (and not abstracts which tend not to be concise).

Therefore, we're thinking of using a subset of the BRENDA data in conjunction with the papers that include those reactions to generate more substrate/enzyme training data that comes straight from literature... how to do this is still a bit unclear but it'd be great to hash this out with someone!

igematberkeley / NLPChemExtractor

Training data creation from BRENDA data using weak model #19