Fix C++/Bluehawk markup collisions

dacharyc commented 1 year ago

When using Bluehawk with C++, Bluehawk can interpret C++ syntax as the beginning of a Bluehawk markup tag.

Example:

user::state

Bluehawk interprets this as the beginning of the :state: tag.

It would be great if Bluehawk could ignore :: syntax so it doesn't interpret C++ as the beginning of a markup tag.

dacharyc commented 10 months ago

Just ran into this again - another example (I had to remove from examples/cpp/sync/sync-session.cpp to generate the example):

auto connectionState = syncSession->state();
CHECK(connectionState ==
      realm::internal::bridge::sync_session::state::paused);

masukomi commented 10 months ago

The problem ultimately lies within the application of this regexp in token.ts line 87

const TAG_PATTERN /*      */ = /:([A-z0-9-]+):[^\S\r\n]*/;

This is not limited to the word "state". It will happen with ALL tags keywords that use a -start & -end component including "snippet" and "block-tag".

Potential Solutions

IF you have a constant list of tags that use -start then you could generate the a regexp on the fly that utilized that list of strings and matched : + letters/numbers/hypens : UNLESS the characters in that capture were any of the strings that made up the names of tags that also used -start

for example

# this would NOT match because "state" is a known non-line mode keyword (`foo-start` things)
:state:
# this would match because it's NOT a known non-line mode keyword
:remove:

Without a centralized list like that you could hardcode the regexp to exclude those, but then you'd have the maintenance task to remember to update that whenever you added a new foo-start tag. Someone would inevitably forget so....

alternately you could handle it at the other end of the process and modify validator.ts line 64 to ONLY blow up if tagNode.tagName is a tag in the list of tags that support line mode and then do... something to undo the thinking that it's dealing with a tag.

Unfortunately, you've still got the problem of either needing a centralized list that can be checked (that may exist) or needing to manually hardcode all the names that should be ignored.

HOWEVER...

the things above only address PART of the problem. If we apply any of those solutions we still have the problem that many languages use :: as a namespace separator AND that some people are putting slack style emoji in comments (e.g. marketing made me do this :facepalm:)

the regexp at the beginning needs to be modified.

needs a negative lookbehind added so that it never matches ::foo at the start of a potential tag string but does match :foo
needs additional restrictions added so that valid code with text that is also a valid tag like :remove: isn't interpreted as a tag. I should be able to write a line of code with "foo bar :remove: baz" in it. that DOESN'T trigger bluehawk.

I don't know enough about the potential use cases to know what's "right" here but off the top of my head I'm thinking that the system should be modified to ONLY consider :remove: (and other non-line mode tags) a tag IF it is in a comment or IF it is at the start of a line or IF it is preceded by nothing but whitespace and followed by a newline.

Testing that it's in a comment complicates things because then you need knowledge of the comment format(s) of all supported languages, and need to know what languages is in play when the line is parsed, and unless that's already in this somewhere it's going to require a fair amount of code to implement and test.

mongodb-university / Bluehawk

Fix C++/Bluehawk markup collisions #145

Potential Solutions

HOWEVER...