lfoppiano / grobid-superconductors

Grobid module for superconductor material and properties extraction
Apache License 2.0
18 stars 2 forks source link

certain expressions might be wrongly parsed by the text2chem #56

Closed lfoppiano closed 1 year ago

lfoppiano commented 1 year ago

Here if we try to parser the material optimally doped Tl2212 single crystal we obtain as formula Tl2212Tl2212, from the text2chem.

lfoppiano commented 1 year ago

In our material parser we tag this expression as "name", however I'm wondering whether we should recognise these "names" and threat them separately than a material name (e.g., hydrogen)

@kensei-te I'm sure I've asked you already.

The expression Tl2212 has a specific transformation in a normal formula that does not require much effort (e.g. context, or previous knowledge)?

kensei-te commented 1 year ago

Tl2212 is, Tl2Ba2CaCu2O8+delta. Here, delta stands for some value between 0 to 1. (depending on material of solubility of oxygen, delta range can be narrower in practice)
Tc varies drastically (0-90 K) by this amount of delta, while determining delta is possible but not easy from experiment. One needs much effort to do that. Other way is to speculate delta value by using some empirical value, but this does not work when any special case exists.

From machine learning point of view, either "Tl2212" or "Tl2Ba2CaCu2O8+delta" is not suitable as a train data UNLESS delta is clearly specified. This can be true for any other abbreviated guys.

lfoppiano commented 1 year ago

Thank you @kensei-te!

I try to recognise this particular way to writing by regex, to exclude them as material names.
I'm thinking the following sequence:
a) one or two characters and b) optional space / dash c) 3 or 4 numbers

in regex would be something like [A-Za-z]{1,2}[ -]?[0-9]{3,4}.

Such regex would match the following:

Ti-2212 Ti2212 Ti 2212 Y-123 Y123 Y 123

My question, do you think there are additional forms that could escape?