standize / taste

Programming language detection made easy
0 stars 0 forks source link

WieeRd file formats #2

Open SkuldNorniern opened 12 months ago

SkuldNorniern commented 12 months ago

Here's the collection of WieeRd file name schemes/formats that are quite odd to handle, the list will update regularly

WieeRd commented 12 months ago

We need to settle 2 matters before dealing with these wacky woohoo formats.

What should and should not be included as a supported language?

We've narrowed this down to "text-based code, document, data, and DSL" (albeit some ambiguity remains).

To what extent of complexity are we willing to implement detection methods?

My suggested policy for limiting the scope of detection methods is no regex, no heuristics.

WieeRd commented 12 months ago

More on this decision later, but for now, I did come up with a way to detect above formats without involving regex. I present to you the "trigger-determiner" method:

Examples

  1. Given path has one of the filenames or extensions registered as a trigger
  2. Attempt to fs::canonicalize() the path, or fallback to original path
  3. Check the parent directory / iterate over ancestor directories until we find a determiner

Compared to regex, which is ran on every given path and tests each pattern one by one, this method it is only triggered on certain filenames and extensions and is O(1).

As you can see from above, this pattern translates nicely to bash glob syntax. We can use this as a notation for this detection methods.

@SkuldNorniern If you can translate all of above regex patterns to glob syntax that would be great. Keep collecting patterns like these from resources listed in #1 and report back if you find something that cannot be handled with this method.

WieeRd commented 12 months ago

Also, If you have time to do so, please do look into the docs of each format and verify where they are actually located rather than just staring at those strange regexes. Some of them seems to be malformed in my eyes.