lfoppiano / text2chem

RegEx-based text parser that converts chemical terms and material entities into chemical datastructure.
MIT License
0 stars 1 forks source link

Invalid parsing of Oxygen doping? #2

Open lfoppiano opened 9 months ago

lfoppiano commented 9 months ago

The formula: Bi2Sr2CaCu2O 8+δ is incorrectly parsed by material_parser.parse() as:

composition = {dict: 1} {'composition': OrderedDict([('Bi', '2'), ('Sr', '2'), ('Ca', '1'), ('Cu', '2'), ('O', '8')])}
 'composition' = {OrderedDict: 5} OrderedDict([('Bi', '2'), ('Sr', '2'), ('Ca', '1'), ('Cu', '2'), ('O', '8')])
  'Bi' = {str} '2'
  'Sr' = {str} '2'
  'Ca' = {str} '1'
  'Cu' = {str} '2'
  'O' = {str} '8'
  __len__ = {int} 5
 __len__ = {int} 1

Oxygen should be 8+δ

It seems a problem only with the latest element + amount

GGNoWayBack commented 9 months ago

This step primarily occurs in ~text2chem.regex_parser.separate_oxygen_deficiency. Its purpose is to separate the deficient or excess (commonly denoted as +-δ in chemistry) oxygen atoms from the formula. The final result reflects this separation in the oxygen_deficiency :

{'material_string': 'Bi2Sr2CaCu2O8-δ', 
'material_name': '', 
'material_formula': 'Bi2Sr2CaCu2O8', 
'additives': [], 'phase': '', 
'oxygen_deficiency': '-', 
'amounts_x': {}, 
'elements_x': {}, 
'composition': [{'formula': 'Bi2Sr2CaCu2O8', 'amount': '1', 'elements': OrderedDict([('Bi', '2'), ('Sr', '2'), ('Ca', '1'), ('Cu', '2'), ('O', '8')]), 'species': OrderedDict([('Bi', '2'), ('Sr', '2'), ('Ca', '1'), ('Cu', '2'), ('O', '8')])}]}

{'material_string': 'Bi2Sr2CaCu2O8+δ', 
'material_name': '', 
'material_formula': 'Bi2Sr2CaCu2O8', 
'additives': [], 'phase': '', 
'oxygen_deficiency': '+', 
'amounts_x': {}, 
'elements_x': {}, 
'composition': [{'formula': 'Bi2Sr2CaCu2O8', 'amount': '1', 'elements': OrderedDict([('Bi', '2'), ('Sr', '2'), ('Ca', '1'), ('Cu', '2'), ('O', '8')]), 'species': OrderedDict([('Bi', '2'), ('Sr', '2'), ('Ca', '1'), ('Cu', '2'), ('O', '8')])}]}

Specifically, if oxygen_deficiency='-', oxygen should be represented as 8-δ, and similarly for other cases.

lfoppiano commented 9 months ago

Am I understanding well that it's correct that δ is not included in the composition?

GGNoWayBack commented 9 months ago

Am I understanding well that it's correct that δ is not included in the composition?

Based on the design by the code owner and my understanding, it seems to be like this.