onetrueawk / awk

One true awk
Other
1.99k stars 160 forks source link

Add RT to complement RS being a regex. #224

Closed caffe3 closed 6 months ago

caffe3 commented 6 months ago

Since RS can be defined as a regular expression, it makes it difficult to use RS both like this and also keep the input line verbatim because it effectively deletes the input line characters that match RS.

This pull requests adds the common extension of RT that is present in busybox, gawk and goawk that provides the matched value of RS for the current input record.

caffe3 commented 6 months ago

While doing some additional testing, I overlooked the case where RS="".

Also I noticed some adjbuf use-after-free problems. I'll do a separate PR for that before revisiting this.

plan9 commented 6 months ago

I don't recall this feature ever being discussed. It is not a part of OTA, and it is not in the second edition. I'm not sure I'm happy with this sudden feature add without consulting first because some other implementations happen to have it.

plan9 commented 6 months ago

let me be more explicit. this will not make it to awk at this time.

caffe3 commented 6 months ago

My apologies - I intended this PR as an initial discussion (hence the draft) rather than something that I wanted to be included at all cost. I'll communicate that properly next time.

arnoldrobbins commented 6 months ago

Hi. I won't comment here about whether or not RT should be added to OTA. Speaking from experience, this patch won't quite do the trick. In particular you need to handle the default case where RS = "\n" and also RS = "". For the former, you should not reset RT each time a record is read. That will drive your performance into the ground. For the latter, RT will be a string of two or more newlines. You should check the length of the current record separator against the previous one and only reset RT if the new one is longer. If it's shorter, you can just truncate the string in the old one. I hope this helps.