mcs07 / ChemDataExtractor

Automatically extract chemical information from scientific documents
http://chemdataextractor.org
MIT License
287 stars 112 forks source link

Extracting entities inside an entity #32

Open gihanpanapitiya opened 3 years ago

gihanpanapitiya commented 3 years ago

Does anyone knows how to write a custom parser to extract a named entity inside an entity.

For example from the following sentence I want to extract 'boiling' which will be inside the prefix entity.

d = Sentence('Synthesis of 2,4,6-trinitrotoluene (3a).The procedure was followed to yield a pale yellow solid (boiling point 240 °C)')

This is my attempt to write the parser:

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    prefix = StringType()
    name = StringType()

Compound.boiling_points = ListType(ModelType(BoilingPoint))`

prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling')(u'name') + I(u'point')).add_action(join)(u'prefix')
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')

class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()')),
                    prefix = first(result.xpath('./prefix/text()')),
                    name = first(result.xpath('./name/text()')),

                )
            ]
        )
        yield compound

Sentence.parsers = [BpParser()]

However what d.records.serialize() produces is,

[{'boiling_points': [{'value': '240', 'units': '°C', 'prefix': 'boiling point'}]}]

maddenfederico commented 3 years ago

All you have to do is tweak the xpath you use to access the result from the name element. Element results are returned as a tree with whatever you assign to root as the root and all the elements that form a part of root as child nodes, and so on.

So you would write name = first(result.xpath('./prefix/name/text()')), since name is a child of prefix

gihanpanapitiya commented 3 years ago

All you have to do is tweak the xpath you use to access the result from the name element. Element results are returned as a tree with whatever you assign to root as the root and all the elements that form a part of root as child nodes, and so on.

So you would write name = first(result.xpath('./prefix/name/text()')), since name is a child of prefix

I tried that, but I am still getting the same output as before.

maddenfederico commented 3 years ago

might be the .add_action(join) then. Seems like that merges all of the tokens and puts them in the same node. It may not be the best solution, but the first thing that comes to my mind is to capture boiling and point as separate elements and then join them within interpret(). I'm actually curious so I'm about to do my own tests

gihanpanapitiya commented 3 years ago

Thanks for the suggestion! I haven't worked with interpret(). I am going to start experimenting with it.