hexh250786313 / blog

hexh 的博客
https://github.com/hexh250786313/blog/issues
40 stars 0 forks source link

perl 正则中后置约束贪婪匹配字符过长的问题 #25

Open hexh250786313 opened 2 years ago

hexh250786313 commented 2 years ago
不要点开, 博客网站用的
博文标题图片
![pic](https://dev.azure.com/hexuhua/f6126346-6e87-4d62-aa80-ff9b88293af0/_apis/git/repositories/ebd79495-5cbb-4565-8573-fa73ee451b5e/items?path=/github.com/hexh250786313/blog/25/2022-09-23_12-17.png&versionDescriptor%5BversionOptions%5D=0&versionDescriptor%5BversionType%5D=0&versionDescriptor%5Bversion%5D=main&resolveLfs=true&%24format=octetStream&api-version=5.0)
博文置顶图片
![pic](https://dev.azure.com/hexuhua/f6126346-6e87-4d62-aa80-ff9b88293af0/_apis/git/repositories/ebd79495-5cbb-4565-8573-fa73ee451b5e/items?path=/github.com/hexh250786313/blog/25/2022-09-23_12-17.png&versionDescriptor%5BversionOptions%5D=0&versionDescriptor%5BversionType%5D=0&versionDescriptor%5Bversion%5D=main&resolveLfs=true&%24format=octetStream&api-version=5.0)
博文置顶说明
最近写一些文本处理脚本的时候遇到了使用 perl 提示 "Lookbehind longer than 255 not implemented in regex" 这样的错误, 不是什么大问题, StackOverflow 里也能找到答案, 但是中文互联网上却没有相关的条目, 于是这里稍微记录下

相关

背景

我的脚本调用了 perl 来做文本处理, 里面有一个正则用到了后置约束:

perl -0777 -i -pe "s/(?<!(.*\\S.*))name:.*/name:\ 'hexh'/gi" ./test.txt

目的是把不带任何非空前缀的 name 值改为 hexh, 例如:

// source:
  name: 'hexuhua'
  fullname: 'hexuhua'

// expected:
  name: 'hexh'
  fullname: 'hexuhua'

但是却报错了:

Lookbehind longer than 255 not implemented in regex m/(?<!(.*\S.*))name:.*/ at -e line 1.

方案

根据这篇博客: http://blogs.perl.org/users/tom_wyant/2019/03/native-variable-length-lookbehind.html

Now, there is at least one restriction. No lookbehind assertion can be more than 255 characters long. This limit has been around, as nearly as I can tell, ever since lookaround assertions were introduced in 5.005. But it has been lightly documented until now. This restriction means you can not use quantifiers * or +. But bracketed quantifiers are OK, as is ?.

大概翻译下: 任何 lookbehind 断言的长度都不能超过 255 个字符, 自从 5.005 版引入 lookbehind 断言以来, 这个限制就一直存在, 这个限制意味着你不能使用 .* 或者 .+ 这样的贪婪匹配, 而用非贪婪匹配如: 大括号限制 255 字符内或者 .? 的形式则是没问题的

那么也就是说对于上述正则: s/(?<!(.*\\S.*))name:.*/name:\ 'hexh'/gi 的问题就出在了后置约束 ?<!(.*\\S.*) 中, perl 要求约束中的字符不能用贪婪匹配且少于 255 个字符

由此分析, 可以改成类似这样的形式: ?<!({0,127}\\S.{0,127}), 保证括弧内的字符数量少于等于 255 个即可

最终命令如下:

perl -0777 -i -pe "s/(?<!(.{0,127}\\S.{0,127}))name:.*/name:\ 'hexh'/gi" ./test.txt

值得一提

值得一提的是, 后置约束对于 perl 来说依然属于实验性功能, 每次用后置断言后它都会有这样的提示:

Variable length negative lookbehind with capturing is experimental in regex;