phHartl / eu-judgement-analyse

Quantitative analysis of judgments of the European Court of Justice
MIT License
6 stars 0 forks source link

Find a suitable regex query to eliminate parapraph numbers #26

Closed phHartl closed 4 years ago

phHartl commented 4 years ago

Best bet atm:

(?i)((?<=[.’']\s)|(?<=\bgrounds\b\s)|(?<=\blaw\b\s)|(?<=\bjudgment\b\s)|(?<=\bpreliminary ruling\b\s)|(?<=\bcosts\b\s))(\d{1,3}\.*\s)(?=[A-Z])

https://regex101.com/r/2AgRRW/1

thomfischer commented 4 years ago

Since the given example does not cover every special case of paragraph numbers not being precedented by a full stop, a generic regular expression should be used. It improves the consistency of the data, while also increasing perfomance with a negligible increase in paragraph numbers remaining in the text.

Proposal: (?<=[.’']\s)(\d+\.*\s)(?=[A-Z])

Translation: Query digits followed by other digits, a dot (optional) and a space. Ensure it is followed by a capital letter and precedented by a full stop or single quote followed by a space.