lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.86k stars 413 forks source link

Lark can't match µ character even though it is defined in the input #1478

Open adi12345-crypto opened 6 days ago

adi12345-crypto commented 6 days ago

I am trying to write a lark parser which can extract parts of text of form number unit e.g 10 grams or 10gm etc. I am trying to parse the following input: "10 μl" I get the error `10 μl ^ Expected one of:

None My token VOLUME_MCL is defined as: VOLUME_MCL: "mikroliter" | "microl" | "mcl" | "µl" | "ul" `

adi12345-crypto commented 6 days ago

For reference here is the grammar:

// define rules for strength and strength ranges when they appear alone and with additional text start: strength_only | strength_range_only | strength_additional_text | strength_range_additional_text | concatenated_strengths_only | words_only

// define structure for rules with no additional text strength_only.2: number unit | number unit? separator number? unit concatenated_strengths_only.2: strength_only concatenator strength_only strength_range_only.2: strength_only range_separator strength_only | number range_separator number unit // define structure for rules with additional text strength_additional_text: (skippable strength_only skippable)+ strength_range_additional_text: (skippable strength_range_only skippable)+ // define structure for words only words_only: extended_word+

// define non-terminals units as all possible unit rules then for each unit rule specify units unit: base_unit | base_unit separator base_unit base_unit.2: weight_unit | limit_of_flocculation_unit | volume_unit | length_unit | mole_unit | iu_unit | percent_unit | time_unit | percent_weight_volume | dosis_unit | colony_forming_unit | kallikrein_inactivator_unit | plaque_forming_unit | becquerel_unit | cell_unit | area_unit | curie_unit | tissue_culture_infectious_dose_unit | hahnemannian_unit | parts_per_million_unit | vector_genome_unit | anti_xa_unit | cell_culture_infectious_dose_unit | equivalents_unit | d_antegen_unit | elisa_unit | allergan_unit | effective_response_unit | relative_potency_unit | tcp_unit | haemagglutination_inhibition_unit | oocyst_unit | high_activation_unit | antigenic_unit | tuberculin_unit | antibody_micro_agglutination_lytic_reaction_unit | speywood_unit | galactosidase_unit | pulsation_unit | katal_unit | germs_unit | kallidinogenase_inactivator_unit | usp_unit | homeopathic_potency_unit | bioequivalent_allergy_unit | fluorescent_focus_unit | protein_nitrogen_unit

// define all weight units weight_unit: WEIGHT_TON | WEIGHT_KGS | WEIGHT_KGS | WEIGHT_LBS | WEIGHT_GM | WEIGHT_GM_BASE | WEIGHT_MG | WEIGHT_MG_BASE | WEIGHT_MCG | WEIGHT_MCG_BASE | WEIGHT_NG | WEIGHT_PG WEIGHT_TON: "tonelada" | "tons" | "ton" | "tne" | "mts" WEIGHT_KGS: "kilograms" | "kilogramm" | "Kilogramm" | "kilogram" | "kilo" | "kgs" | "kgm" | "kga" | "kg" WEIGHT_LBS: "lbs" WEIGHT_GM: "gramm" | "grams" | "gramo" | "gram" | "gms" | "gm" | "g." | "g" WEIGHT_GM_BASE: "g base" WEIGHT_MG: "milligramm" | "mili.gram" | "miligramo" | "milig" | "mgs" | "mg" WEIGHT_MG_BASE: "mg base" WEIGHT_MCG: "micrpgrammes" | "microgrammes" | "microgramme"| "mikrogramm" | "microgramo" | "mikrograma" | "mikrogramów" | "microgram" | "microg" | "mcg" | "µg" | "ug" WEIGHT_MCG_BASE: "mcg base" WEIGHT_NG: "nanogramm" | "ng" WEIGHT_PG: "picogramm"

// define all volume units volume_unit: VOLUME_GAL | VOLUME_L | VOLUME_DROP | VOLUME_ML | VOLUME_MCL VOLUME_GAL: "gal" VOLUME_L: "liter" | "l" VOLUME_DROP: "drop" VOLUME_ML: "millilitre" | "milliliter" | "ml" VOLUME_MCL: "mikroliter" | "microl" | "mcl" | "µl" | "ul"

// define all length units length_unit: LENGTH_MM | LENGTH_CM LENGTH_MM: "millimeter" | "mm" LENGTH_CM: "zentimeter" | "cm"

//define all mole units mole_unit: MOLE_NMOL | MOLE_MMOL | MOLE_MCMOL | MOLE_MOL MOLE_NMOL: "nanomol" | "nmol" MOLE_MMOL: "millimol" | "mmole" | "mmol" MOLE_MCMOL: "micromol" | "mcmol" | "µmol" MOLE_MOL: "mole" | "mol"

// define all iu units iu_unit: IU_IU | IU_KIU | IU_MIU IU_IU: "internationale einheit(en)" | "internationale einheit" | "pressor units" | "unités" | "u.i." | "i.u." | "i.e." | "unit" | "u." | "iu" | "[iu]" | "j.m." | "ie" | "ui" | "u" IU_KIU: "kilo UI" | "kiu" IU_MIU: "million international units (iu)" | "millones ui" | "millions ui" | "million i.e." | "mill. ui" | "million u" | "m.ui" | "miu" | "mu"

// define all percent units percent_unit: PERCENT PERCENT: "porcentaje" | "porciento" | "%"

// define all time units time_unit: TIME_MIN | TIME_HOUR | TIME_DAY TIME_MIN: "min" TIME_HOUR: "hour" | "h" TIME_DAY: "day"

// define all percent weight volume units percent_weight_volume: PERCENT_WEIGHT_PER_WEIGHT | PERCENT_WEIGHT_PER_VOLUME | PERCENT_VOLUME_PER_VOLUME | PERCENT_VOLUME_PER_WEIGHT PERCENT_WEIGHT_PER_WEIGHT: "% (w/w)" | "% w/w" | "% / w/w" | "%w/w" | "porcentaje peso/peso" PERCENT_WEIGHT_PER_VOLUME: "% / w/v" | "% w/v" | "% (w/v)" PERCENT_VOLUME_PER_VOLUME: "prozentgehalt volumen in volumen" | "% / v/v" | "% v/v" | "% (v/v)" PERCENT_VOLUME_PER_WEIGHT: "Prozentgehalt Volumen in Masse"

// define all dosis units dosis_unit: DOSIS_DOS | DOSIS_VIAL | DOSIS_BAG | DOSIS_BOTTLE | DOSIS_SACHET | DOSIS_SYRINGE| DOSIS_CONTAINER | DOSIS_SRT | DOSIS_ACT | DOSIS_GUM | DOSIS_BLISTER | DOSIS_STRIP | DOSIS_CAPSULE | DOSIS_KIT | DOSIS_TABLET | DOSIS_PACK | DOSIS_LOZENGE | DOSIS_CARTRIDGE | DOSIS_PIECE | DOSIS_GENERATOR | DOSIS_CYLINDER DOSIS_DOS: "dawkę odmierzoną" | "dos(es)" | "dose(s)" | "dosis" | "dose" | "dos" DOSIS_VIAL: "flacon" | "fiolkę" | "vial" DOSIS_BAG: "bag" | "zak" DOSIS_BOTTLE: "bottle" DOSIS_SACHET: "sachet" DOSIS_SYRINGE: "syr" DOSIS_CONTAINER: "container" DOSIS_SRT: "srt" DOSIS_ACT: "act" DOSIS_GUM: "gum" DOSIS_BLISTER: "blister" DOSIS_STRIP: "strip" | "pasek" DOSIS_CAPSULE: "capsules" | "capsule" | "cap" DOSIS_KIT: "kit" DOSIS_TABLET: "tablet" | "tab" DOSIS_PACK: "pck" DOSIS_LOZENGE: "loz" DOSIS_CARTRIDGE: "cartridge" DOSIS_PIECE: "stück" | "piece" | "stuk" DOSIS_GENERATOR: "gen" DOSIS_CYLINDER: "cylr"

// list all colony forming units colony_forming_unit: COLONY_FORMING_UNIT | LOG_COLONY_FORMING_UNIT | MILLION_COLONY_FORMING_UNIT | BILLION_COLONY_FORMING_UNIT COLONY_FORMING_UNIT: "koloniebildende einheit(en)" | "cfu" | "[cfu]" LOG_COLONY_FORMING_UNIT: "log10 cfu" MILLION_COLONY_FORMING_UNIT: "millionkeime" | "million cfu" BILLION_COLONY_FORMING_UNIT: "b"

// list all kallikrein inactivator units kallikrein_inactivator_unit: KALLIKREIN_INACTIVATOR_UNIT KALLIKREIN_INACTIVATOR_UNIT: "kallikrein-inhibitor-einheit" | "kui"

// list all plaque forming units plaque_forming_unit: PLAQUE_FORMING_UNIT | LOG_PLAQUE_FORMING_UNIT PLAQUE_FORMING_UNIT: "unidades formadoras de placa (ufp)" | "pfu" | "[pfu]" LOG_PLAQUE_FORMING_UNIT: "log10 pfu"

// list all becquerel unit becquerel_unit: BECQUEREL | KILO_BECQUEREL | MEGA_BECQUEREL | GIGA_BECQUEREL BECQUEREL: "becquerel" KILO_BECQUEREL: "kilobequerelio" | "kbq" MEGA_BECQUEREL: "megabecquerelio" | "megabecquerel" | "mbq" GIGA_BECQUEREL: "gigabecquerelios" | "gigabecquerel" | "gbq"

// list all cell unit cell_unit: CELL | MILLION_CELL CELL: "celulas" | "komorek" | "cellen" | "cells" MILLION_CELL: "millions de cellules" | "million cells" | "mln komórek"

// list all area unit area_unit: AREA_CM_SQ AREA_CM_SQ: "quadratzentimeter" | "sq cm" | "cm2"

// list all currie unit curie_unit: CURIE | MILICURIE | MICROCURIE CURIE: "ci" MILICURIE: "mci" | "millicurie" MICROCURIE: "mcci" | "mikrocurie"

// list all tissue culture infectious dose unit tissue_culture_infectious_dose_unit : TISSUE_CULTURE_INFECTIOUS_DOSE | LOG_TISSUE_CULTURE_INFECTIOUS_DOSE TISSUE_CULTURE_INFECTIOUS_DOSE: "gewebekultur-infektiöse-dosis 50%" | "tcid50" | "[tcid_50]" LOG_TISSUE_CULTURE_INFECTIOUS_DOSE: "log10 tcid50"

// list all hahnemannian units hahnemannian_unit: DECIMAL_HAHNEMANNIAN DECIMAL_HAHNEMANNIAN: "dh"

// list all ppm units parts_per_million_unit: PARTS_PER_MILLION_UNIT PARTS_PER_MILLION_UNIT: "ppm" | "[ppm]"

// list all vector_genome units vector_genome_unit: VECTOR_GENOME_UNIT VECTOR_GENOME_UNIT: "vg"

// list all anti xa unit anti_xa_unit: ANTI_XA_UNIT ANTI_XA_UNIT: "anti-blutgerinnungsfaktor xa aktivität" | "anti xa units" | "u.i. antixa" | "ui anti-xa" | "ul anti-xa" | "anti-xa iu" | "anti-xa ui" | "ui antixa" | "anti-xa" | "j.m. a.xa" | "unidades antigenicas"

//list all cell_culture_infectious_dose unit cell_culture_infectious_dose_unit: CELL_CULTURE_INFECTIOUS_DOSE | LOG_CELL_CULTURE_INFECTIOUS_DOSE CELL_CULTURE_INFECTIOUS_DOSE: "ccid50" | "cid50"| "[ccid_50]" LOG_CELL_CULTURE_INFECTIOUS_DOSE: "log10 ccid50" | "log10 cid50"

// list all equivalents unit equivalents_unit: EQUIVALENTS_UNIT | MILI_EQUIVALENTS_UNIT EQUIVALENTS_UNIT: "eq" MILI_EQUIVALENTS_UNIT: "meq"

// list all d antegen units d_antegen_unit: D_ANTEGEN_UNIT D_ANTEGEN_UNIT: "unités antigènes d" | "D-UNITS" | "[d'ag'u]" | "d-au" | "du"

// list all elisa unit elisa_unit: ELISA_UNIT | LOG_ELISA_UNIT ELISA_UNIT: "elisa unit" | "elisa u" | "u elisa" LOG_ELISA_UNIT: "log10 elisa u"

// list all all alergan unit allergan_unit: ALLERGAN_UNIT ALLERGAN_UNIT: "allergan-einheit"| "allergen-einheit" | "alergen units" | "allergan unit" | "Allergan units" | "[au]" | "au"

// list all effective response effective_response_unit: EFFECTIVE_RESPONSE_25 | EFFECTIVE_RESPONSE_50 | EFFECTIVE_RESPONSE_60 | EFFECTIVE_RESPONSE_70 | EFFECTIVE_RESPONSE_120 EFFECTIVE_RESPONSE_25: "% er25" EFFECTIVE_RESPONSE_50: "% er50" EFFECTIVE_RESPONSE_60: "% er60" EFFECTIVE_RESPONSE_70: "% er70" EFFECTIVE_RESPONSE_120: "% er120"

// list all relative potency unit relative_potency_unit: RELATIVE_POTENCY_UNIT RELATIVE_POTENCY_UNIT: "rp"

// list all tcp unit tcp_unit: TCP_UNIT TCP_UNIT: "tcp units"

// list all haemagglutination_inhibition_unit haemagglutination_inhibition_unit: HAEMAGGLUTINATION_INHIBITION_UNIT | LOG_HAEMAGGLUTINATION_INHIBITION_UNIT HAEMAGGLUTINATION_INHIBITION_UNIT: "hai" | "hiu" LOG_HAEMAGGLUTINATION_INHIBITION_UNIT: "log10 hai" | "log10 hi" | "log10 hiu"

// list all oocysts unit oocyst_unit: OOCYST_UNIT | SPORULATED_OOCYST_UNIT OOCYST_UNIT: "oocysts" SPORULATED_OOCYST_UNIT: "sporulated oocysts"

// list all high activation unit high_activation_unit: HIGH_ACTIVATION_UNIT HIGH_ACTIVATION_UNIT: "hau"

// list all antigenic unut antigenic_unit: ANTIGENIC_UNIT ANTIGENIC_UNIT: "antigenic units" | "unidades antigénicas"

// list all tuberculin_units tuberculin_unit: TUBERCULIN_UNIT TUBERCULIN_UNIT: "tu"

// list all antibody_micro_agglutination_lytic_reaction unit antibody_micro_agglutination_lytic_reaction_unit: ANTIBODY_MICRO_AGGLUTINATION_LYTIC_REACTION_UNIT ANTIBODY_MICRO_AGGLUTINATION_LYTIC_REACTION_UNIT: "alr"

// list all speywood unit speywood_unit: SPEYWOOD_UNIT SPEYWOOD_UNIT: "unités speywood" | "speywood units"

// list all galactosidase_units galactosidase_unit: GALACTOSIDASE_UNIT GALACTOSIDASE_UNIT: "galu"

// list all pulsation units pulsation_unit: PULSATION_UNIT PULSATION_UNIT: "pulsación"

// list all katal unit katal_unit: KATAL | MICROKATAL KATAL: "katal" | "katals" MICROKATAL: "microkatals" | "mikrokatal"

// list all limit_of_flocculation_unit limit_of_flocculation_unit: LIMIT_OF_FLOCCULATION_UNIT LIMIT_OF_FLOCCULATION_UNIT: "lf"

// list all germs unit germs_unit: GERMS_UNIT GERMS_UNIT: "keime"

// list all kallidinogenase_inactivator_unit kallidinogenase_inactivator_unit: KALLIDINOGENASE_INACTIVATOR_UNIT KALLIDINOGENASE_INACTIVATOR_UNIT: "kallidinogenase-inaktivator-einheit"

// list all usp unit usp_unit: USP_UNIT USP_UNIT: "usp-einheiten" | "[usp'u]" | "[usp]"

homeopathic_potency_unit: HOMEOPATHIC_POTENCY_UNIT | HOMEOPATHIC_POTENCY_X_UNIT | HOMEOPATHIC_POTENCY_C_UNIT | HOMEOPATHIC_POTENCY_M_UNIT | HOMEOPATHIC_POTENCY_Q_UNIT HOMEOPATHIC_POTENCY_UNIT: "hp" HOMEOPATHIC_POTENCY_X_UNIT: "[hp_x]" HOMEOPATHIC_POTENCY_C_UNIT: "[hp_c]" HOMEOPATHIC_POTENCY_M_UNIT: "[hp_m]" HOMEOPATHIC_POTENCY_Q_UNIT: "[hp_q]"

// list all bioequivalent_allergy_units bioequivalent_allergy_unit: BIOEQUIVALENT_ALLERGY_UNIT BIOEQUIVALENT_ALLERGY_UNIT: "[bau]"

// list all fluorescent_focus_units fluorescent_focus_unit: FLUORESCENT_FOCUS_UNIT FLUORESCENT_FOCUS_UNIT: "[ffu]" | "ffu"

// list all protein_nitrogen_units protein_nitrogen_unit: PROTEIN_NITROGEN_UNIT PROTEIN_NITROGEN_UNIT: "[pnu]" | "pnu"

// list all helper terminals used to define strength rules not_important: extended_word | ignorable_special_chars skippable: not_important+ spaces: SPACES extended_word: spaces EXTENDED_WORD spaces ignorable_special_chars: spaces IGNORABLE_SPECIAL_CHARS spaces number.2: spaces NUMBER spaces separator.2: spaces SEPARATOR spaces concatenator.2: spaces CONCATENATOR spaces range_separator.2: spaces RANGE_SEPARATOR spaces*

// define general terminals EXTENDED_WORD: /[a-zA-ZöäüéÖÄÜß]+/ IGNORABLE_SPECIAL_CHARS: /[,.;:-[]()+&%*]/ SEPARATOR: "in" | "per" | "\" | "/" | "per dose of" CONCATENATOR: "+" | "," RANGE_SEPARATOR: "-" | "–" | "to" NUMBER: /\d+([ \s.,]\d)|[.,]?\d+([eE^xX][-]?\d+)?/ SPACES: /\s+/

any tips on improving the grammar are also aprreciated :)