rasendubi / uniorg

An accurate Org-mode parser for JavaScript/TypeScript
https://oleksii.shmalko.com/uniorg
GNU General Public License v3.0
256 stars 24 forks source link

Keep the original letter-casing of keywords during the parsing phase #72

Open Delapouite opened 1 year ago

Delapouite commented 1 year ago

Hi

The org syntax allow to use both UPPERCASE and lowercase keywords. Example:

#+PROPERTIES
…
#+END

versus

#+properties
…
#+end

Currently, the parser forces the UPPERCASE output :

https://github.com/rasendubi/uniorg/blob/a74a80bedb41cdc3190bec1feb09f5b19c3c63ed/packages/uniorg-parse/src/parser.ts#L1030

I understand that in a way this step can be beneficial to homogenize down the process pipeline.

But in situation where lots of org documents have been authored with the lowercase style, it means that in the case of pipeline doing read org files → parse them → do stuff → stringify → overwrite the file, this change of cosmetic style introduces a lot of noise, especially in diffs if the org files are versioned with git by examples.

Do you think we could keep the current behavior by default but add a new option to keep the case as authored in the original doc?

Thanks!

rasendubi commented 1 year ago

Do you necessarily need to preserve the original spelling?

Would having an option in uniorg-stringify to select uppercase/lowercase spelling work for you?

I'm just worrying that allowing any case would complicate the processing and plugins. Besides upper- and lowercase, any mix is allowed (#+Title,

+tiTLe), so all processors would have to take that into account

Delapouite commented 1 year ago

I was not aware of the mixed-case possibilities. So I think you're right. Focusing on either upper or lowercase choice should already by a good enough option. Thanks

rasendubi commented 1 year ago

Just checked and the current behavior is also consistent with org-elements (the reference parser in emacs-lisp).

Given the following org document:

#+test: blah

it produces the following AST:

((section
  (:begin 1 :end 13 :mode first-section :granularity nil)
  (keyword
   (:key "TEST" :value "blah" :mode top-comment :granularity nil))))

The lower-casing can be implemented in two ways: as a unified plugin (that traverses all keywords and lower-cases keys) or as a configuration for uniorg-stringify.

The plugin could go like this:

unified()
  .use(uniorgParse)
  .use(otherPlugins)
  // This plugin should be added immediately before
  // uniorg-stringify to not mess up with other plugins.
  .use(() => (tree) => {
    // visit from unist-util-visit
    visit(tree, 'keyword', (keyword) => {
      keyword.key = keyword.key.toLowerCase();
    });
  })
  .use(uniorgStringify)

Adjusting uniorg-stringify is obviously more involved. Especially because it's currently lacking in options handling. Though if we implement handlers as in uniorg-rehype, that makes it much more powerful and I'm willing to accept a PR