Closed olivrrrrr closed 3 years ago
Managed to work this out by altering source code in chemdataextractor/nlp/tokenize.py
` if char in {':', ';'}:
if not (before and after and after[0].isdigit() and before.rstrip('′\'')[-1:].isdigit() and '-' in after) and not (self.NO_SPLIT_CHEM.search(before) and self.NO_SPLIT_CHEM.search(after)):
return self._split_span(span, i, 1)`
to
`if char in {';'}:
if not (before and after and after[0].isdigit() and before.rstrip('′\'')[-1:].isdigit() and '-' in after) and not (self.NO_SPLIT_CHEM.search(before) and self.NO_SPLIT_CHEM.search(after)):
return self._split_span(span, i, 1)`
Hopefully, somebody may find this helpful.
I am attempting to make a custom parser to extract ratio's from this text:
d = Document( Heading(u'1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C2) '), Paragraph(u'Standard Procedure A (0.25 mmol, 1.0 mL CH2Cl2/0.4 mL H2O) was followed with areaction time of 24 h to provide 4A-C2 and 4A-C6 (ratio of 2:3 and C6:C2 by GC-MS, see below)in a combined yield of 66% as a white solid.'))
I am using the following regex in order to do this (it has proven to work independently of the custom parser code):
value = (R("\w+:+\w+"))('value')
Any help would be appreciated.
Full code is here: