support other comment types when scanning scripts

toumorokoshi commented 2 years ago

See https://github.com/toumorokoshi/tome/pull/5/files#r606878773 for context.

We may want to support other common prefixes that are used in scripting languages. Currently tome only supports hashtags, which covers bash, python, ruby, and perl.

toumorokoshi commented 2 years ago

I'll remove this from the 1.0 milestone: it can be added later if someone wants to try it.

toumorokoshi commented 2 years ago

The work should likely be done at:

src/script.rs 66:             } else if line.starts_with("# COMPLETION") {

perhaps replacing these with regex matches. note the edge cases of two-character comments like Lua (--)

zph commented 6 months ago

Reason for wanting this:

We're using a multi-language tome project that includes mainly typescript, python, bash
Because this comment flexibility is missing, we can't use help declarations in our scripts
It results in a worse user experience and more documentation required

I've been thinking about this implementation and think it could catch the majority of cases by having a small lookup table for what commenting mechanism to use based on file type.

ie we could predictably lookup based on file extension combined with a fingerprinting of file content to determine the comment character(s).

A naive approach would be to select known filename extensions as a lookup:

# extension -> CommentMode(single_line_comment_chars, start_comment_chars, end_comment_chars)
.py | .rb -> CommentMode("#", nil, nil)
.ts | .js -> CommentMode("//", "/*", "*/")
.sh | .bash -> CommentMode("#")

That will cover many cases and can be extended to cover known common types.

As a fallback, we could write a very simple parser that tries to grab the first line of file and if it's a hashbang line, then parse out the interpreter.

Have a second mapping table indexed on the file's interpreter.

In case it can't be determined through either of those means, fallback to a default or provide a best guess based on parsing a few initial lines for line prefixes. As in:

Pseudo code

file.readlines.split("\n").slice(1,10).map(line => line.slice(1,3)).frequency.max()

Would you be interested in this if I code it out?

zph commented 6 months ago

If there's an appetite for using a 3rd party library, we could use https://docs.rs/syntect/latest/syntect/parsing/struct.SyntaxSet.html#method.find_syntax_for_file

(Potentially but un-researched) Then pull out the definition from sublime-syntax definitions (ref?) to determine what comment and comment_start and comment_end.

That would be more reliable than rolling out own at the cost of a dependency.

@toumorokoshi If you're interested, do you have a preference/thoughts on the approaches I outlined?

zph commented 5 months ago

I prototyped it using a combination of filename extensions or falling back to parsing the shebang: https://github.com/zph/tome/blob/54329a3d298af75fd48279b1cd550330b44db22c/src/script.rs#L68-L88

If you're interested I can pull it out and contribute upstream 🖖

toumorokoshi commented 5 months ago

thanks! I took a look at the code and I think the approach looks good to me. The syntect approach seems like a slightly more comprehensive approach that should make adding support for new languages easier (although the one-liners you have are pretty easy as-is).

It's a little heavy-handed to have to have a mapping of every possible script type, but I can't think of a better solution - it's just knowledge that has to be built in.

The tests should be pretty straightforward too - just add files for the various extensions, and add a few tests that very we can pull some information out of them (maybe check for strings in tome help?).

But if you can add the code and are having trouble with the tests - I can add them. Thanks for driving this!

toumorokoshi / tome

support other comment types when scanning scripts #22