mcs07 / ChemDataExtractor

Automatically extract chemical information from scientific documents
http://chemdataextractor.org
MIT License
287 stars 112 forks source link

Regex expression to extract colon not working #36

Closed olivrrrrr closed 3 years ago

olivrrrrr commented 3 years ago

I am attempting to make a custom parser to extract ratio's from this text:

d = Document( Heading(u'1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C2) '), Paragraph(u'Standard Procedure A (0.25 mmol, 1.0 mL CH2Cl2/0.4 mL H2O) was followed with areaction time of 24 h to provide 4A-C2 and 4A-C6 (ratio of 2:3 and C6:C2 by GC-MS, see below)in a combined yield of 66% as a white solid.'))

I am using the following regex in order to do this (it has proven to work independently of the custom parser code):

value = (R("\w+:+\w+"))('value')

Any help would be appreciated.

Full code is here:


class RatioOf(BaseModel): 
    value = StringType() 
    prefix = StringType()

Compound.ratio_of` = ListType(ModelType(RatioOf))

import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (I('ratio') | I('of')).hide() 

value = (R("\w+:+\w+"))('value')

ro = (prefix + value)(u'ro') 

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class RoParser(BaseParser):
    root = ro
    def interpret(self, result, start, end):
        compound = Compound(
         ratio_of=[
               RatioOf(
                    prefix=first(result.xpath('./prefix/text()')),
                    value=first(result.xpath('./value/text()'))
                )
            ]
        )
        yield compound
        yield compound

Paragraph.parsers = [RoParser()]

d = Document(
    Heading(u'1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C2) and 1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C6)'),
    Paragraph(u'Standard Procedure A (0.25 mmol, 1.0 mL CH2Cl2/0.4 mL H2O) was followed with a reaction time of 24 h to provide 4A-C2 and 4A-C6 (ratio of 2:3 and C6:C2 by GC-MS, see below)in a combined yield of 66% as a white solid.'))

d.records.serialize()```
olivrrrrr commented 3 years ago

Managed to work this out by altering source code in chemdataextractor/nlp/tokenize.py

` if char in {':', ';'}:

Split around colon unless it looks like we're in a chemical name

            if not (before and after and after[0].isdigit() and before.rstrip('′\'')[-1:].isdigit() and '-' in after) and not (self.NO_SPLIT_CHEM.search(before) and self.NO_SPLIT_CHEM.search(after)):
                return self._split_span(span, i, 1)` 

to

`if char in {';'}:

Split around colon unless it looks like we're in a chemical name

            if not (before and after and after[0].isdigit() and before.rstrip('′\'')[-1:].isdigit() and '-' in after) and not (self.NO_SPLIT_CHEM.search(before) and self.NO_SPLIT_CHEM.search(after)):
                return self._split_span(span, i, 1)`

Hopefully, somebody may find this helpful.