orgapp / orgajs

parse org-mode content into AST
https://orga.js.org
MIT License
614 stars 61 forks source link

Preserve as much of the original structure as possible #47

Open gitonthescene opened 4 years ago

gitonthescene commented 4 years ago

Hello there,

Thanks again for such an awesome project. It would be great to have an orga-stringify utility to fit more completely into the unified ecosystem and open ourselves up to using more transform tools. Then we could parse org files to an AST, transform them and then re-render the org. Ideally minimal transformations would re-render something pretty close to the original. To do that, we'd need to preserve as much of the original structure as possible.

I propose something like these changes. I'm after the effect more than the approach so I'm happy to discuss/modify/whatever. If you'd like me to make this a pull request, please let me know.

My thinking is that the extra structure in the AST can always be stripped when not needed. For instance, you could filter out whitespace/keyword nodes as well as trim() inner text if desired. But having it in the AST allows us to (nearly) faithfully re-render the original org file.

In there is a separate commit with the changes to the snapped files if you just want to see the effect on the AST. I think in a couple of cases it even renders a bit more accurately.

Again, more than happy to discuss.

Thanks again, -Doug

P.S. I have a prototype for orga-stringify as well which I'll add to my fork as soon as I figure out how lerna works.

gitonthescene commented 4 years ago

I've now incorporated orga-stringify into my fork. It's just pure javascript currently. But when you run the following code on this sample org file it differs from the orginal only by a single trailing new line.

const unified = require("unified");
const vfile = require("to-vfile");
const parse = require("orga-unified");
const render = require("orga-stringify");
const processor = unified().use(parse).use(render, { toJSON: false });

function main() {
  processor
    .process(
      vfile.readSync(
        "/sample/orgfile.txt"
      )
    )
    .then(
      (file) => {
        process.stdout.write(String(file));
      },
      (err) => {
        console.log(String(err));
      }
    );
}
main();

It optionally just spits out the JSON version of the tree using your getCircularReplacer() function.

gitonthescene commented 4 years ago

The head version of my fork now handles the trailing newline. Moreover, it completely reproduces all of the test examples but three. It renumbers two list examples where the numbers are out of order and it reformats a raggedly entered table into a more rectangular one.

gitonthescene commented 4 years ago

Hey there,

Not that this needs to be a goal to have these line up, but for curiosity sake I wrote the following tiny elisp function to have a look at what the emacs internal syntax tree looks like for a given org buffer:

(defun grab-org-nodes (node)
  (list (if (listp node) (car node)) (-map 'grab-org-nodes (om-get-children node))))

You need to package-install both dash.el and om.el to run it. It's just a general outline of the tree. Non-node types show up as nil.

Regards, -Doug

gitonthescene commented 4 years ago

Also, to align with the unified structure maybe orga-unified should be called orga-parse sort of like remark-parse and there can be another package with a frozen parser like remark. Or maybe just make orga-unified have the processor.

gitonthescene commented 4 years ago

It would be great to get a reply here. The more full featured the tools are the more likely they are to be used.

boj commented 4 years ago

@gitonthescene Perusing through this project and wanted to say that this all seems to be on the right track. The ability to convert to<->from the source material without altering it would be a great use case for the toy I have in mind.

gitonthescene commented 4 years ago

Thanks. You're welcome to play with my fork. I'm happy to answer any questions you might have.

xiaoxinghu commented 4 years ago

@gitonthescene orga-stringify looks amazing, I was busy working on v2, part of the reason is that with the strongly typed codebase, it's much easier to collaborate and have a set of conventions. Can you have a look at the current master see if you can adopt the new style. also with v2 we now have Position in nodes. It's extra information that might be useful for faithfully rerender the org-mode text. I'd like to help with any issues.

xiaoxinghu commented 4 years ago

I'd like your opinion here. We now have the ability to tokenize everything including whitespaces, do you think that's a good idea to include all tokens in the AST? I was worried that it's going to be too verbose. So that's why I currently skip all the whitespaces. We can easily change it now. We do have the newline token though, but it's not included in the final Syntax Tree. What's your thought?

gitonthescene commented 4 years ago

Hey, thanks for getting back. I think it makes sense to put in all the tokens until they become a performance problem and even then make the level of detail optional. The reason I say this is that some people may want the full detail to "edit" the tree and then stringify it. That was my use case. The only potential problem I see from the extra detail is performance in processing, but as Knuth says, "The greatest evil in the world is premature optimization". You can always transform a detailed tree into a less detailed tree, but you can't go the other way around. It might even be worth providing a transformer or two which strips whitespace or whatever just to demonstrate. I'm happy to contribute code.

I'll have a look at the master and try to rework orga-stringify. As I said in one of these issues, I was more after the effect than insisting on an approach. I'm a big believer in programming "for effect" (i.e. to an API) since you can always revisit the code later. Plus shipping results helps keep users interested.

Thanks again, -Doug

P.S. since most use cases of this are build time I'd bet most people aren't that performance sensitive.

xiaoxinghu commented 4 years ago

Also, to align with the unified structure maybe orga-unified should be called orga-parse sort of like remark-parse and there can be another package with a frozen parser like remark. Or maybe just make orga-unified have the processor.

My intention for orga is to be standalone, even though it is heavily modelled after remark, but the package orga itself is self-contained. So for the naming of the packages, remark is a unified processor, but orga is not, it's basically a function that parses a string into a syntax tree. So I am thinking of renaming orga-unified into orga-unified-parse, because we are going to add more plugins into the ecosystem, like orga-unified-toc etc. Just to give a hint that these packages should be used within unifiedjs ecosystem. And they are just wrapper around packages like oast-to-hast, which is standalone (the only "dependency" is the HAST definition, which is kind of standard convention rather than dependency). orga-unified-toc should be a think wrapper around oast-toc, just like remark-toc is to mdast-util-toc. What do you think?

Take a look at PR #62

gitonthescene commented 4 years ago

If you mean you want to keep unified wrappers separate from a core orga library, I think that that makes sense. One of the things I like about the unified setup is that it tends to be made up of a lot of small packages so that you only have to pull in what you need, sort of like the UNIX philosophy. If that's the plan, then having a consistent naming for the unified wrappers also makes sense and I think your suggestions sound good. (FWIW, I wasn't really sure what names to use when I made the suggestion above.) FWIW, @wooorm seems like a really helpful guy.

I do kind of like reorg- as a prefix, though. If nothing else, it's less typing.

tconfrey commented 3 years ago

@gitonthescene @xiaoxinghu did anything ever come of this?

For my application I'm only concerned about the header, paragraph text and link elements. I was originally dropping any other elements and handling writing out the header/para/links in an application-specific manner. Most recently I've updated to V2 and am now using the position attributes to save the original text and mirror it back out.