joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

add safeguard when link_text is empty #47

Closed joans closed 1 year ago

joshy commented 1 year ago

hey thanks, after my vacations I can merge the changes.

joshy commented 1 year ago

Could you maybe provide a test case as well?

stevengj commented 1 year ago

Why can't you simply end the hyperlink regex with \}{2,3} (instead of \}{2}) in order to match }} or }}}? Then you can remove the hack completely.

joshy commented 1 year ago

I tried but it never worked.--Sent from phoneOn 2 Jul 2023, at 22:26, Steven G. Johnson @.***> wrote: Why can't you simply end the hyperlink regex with }{2,3} in order to match }} or }}}?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

joans commented 1 year ago

I tried but it never worked. Why can't you simply end the hyperlink regex with }{2,3} in order to match }} or }}}?

Hey! I found out what the error was and submitted a proper fix for it. The regex would also match a line break character and the third group of the matches would contain said line break character, then as an empty string. I now modified the regex so it specifically does not match the line break character.

stevengj commented 1 year ago

I tried [ending the hyperlink regex with \}{2,3}] but it never worked.

In the Julia port my regex is:

const HYPERLINKS = r"(\{\\field\{\s*\\\*\\fldinst\{.*?HYPERLINK\s(\".*?\")\}{2}\s*\{.*?\s+(.*?)\}{2,3})"i

and the {2,3} at the end works fine to match }} or }}} — it passes all of your tests, including the new one in this PR. So I omitted the special _is_hyperlink "ugly hack" entirely. I didn't need the \n filtering by @joans either.

Am I missing something? Or do Julia's regular expression semantics (from the PCRE library) differ from Python's somehow?

joshy commented 1 year ago

Hi Steven, thanks for the notice. I just made the change to use your regex to get rid of the "ugly hack".