emilypi commented 3 years ago

What is this issue?

A central place to discuss the issue of a Cabal exact-printer so that we can push forward the work all in one place. Currently, The discussion stretches back across many issues over the past 6 years.

What is it?

An exact-printer, as inspired by ghc-exactprint, is a byte-for-byte bidirectional parser and pretty printer, which gaurantees the following constructs exist and are principled:

A .cabal file source tree representation, which is in bijection with the package description for a .cabal file for a given spec version. This representation should be parametrized with all offset and encoding information along with file data.
A method for introducing deltas to .cabal file entries. I should able to update and transform individual components of data deep in the representation without modifying the broader structure.
A pretty printer which adds offset annotations to the specification fragment so that the output should have a byte-for-byte round trip on no-op parse/print and print/parse tests.
Tests which confirm this in cabal-testsuite and cabal install's test suite for the individual commands.

Why do we need it?

This work is tied very closely with the format, init, gen-bounds and other work as shown in historical discussions:

This would free us up for several important quality of life improvements to the cabal install ecosystem, as well as general consumers of the Cabal library API. We would like to have this in by Cabal-3.8.0.0. The important bits this enables include:

cabal gen-bounds - will be able to update build-depends stanzas in a principled manner, so that bounds management will be automatic in the future if the user so desires.
cabal format - we hope that a .cabal format can have a canonical shape, so we can limit the amount of deviations from the norm, or ad-hoc formatting choices present in the ecosystem.
cabal init becomes much easier to manage, and we no longer have to dedicate redundant types, formatting, and pretty printing to this effort.
Transforming cabal files, with something like module discovery, becomes that much simpler.

Who's in charge of this?

I am currently overseeing @ptkato who has been tasked with taking this on. Please consider him and myself a point of contact for this work, and join us in libera.chat#hackage to discuss.

@ptkato @gbaz @davean

ulysses4ever commented 1 year ago

The concern would be that it might not keep pace with the spec

The spec doesn't change so very often, I thought.

BurningWitness commented 1 year ago

I do not see a path to replacing the entire existing cabal parser, with its needs for efficiency, back-compat, etc.

While it is a lot of work due to the sheer number of field formats and caveats, I don't see either efficiency or backwards compatibility as issues:

Exact parser should be able to stream its results one node at a time. One node in this case is (section line* | field line* | comment | newline), where line* in sections can be parsed into node*. Some minor inefficiencies will arise due to streaming, e.g. parsing some of the next line just to know if it belongs to this node or not, but I wouldn't sweat it.
Every format version can have a separate representation (data Cabal = CabalV1_0 ... | CabalV2_0 ... | ...), each with its own exact -> representation parser. Due to the aforementioned lack of changes between versions, the only version that truly matters is the latest one, all previous ones are simply clones with some fields added or removed.

On top of it all of this should be completely pure: downloading all of Hackage, and then checking exactness of the exact parser and errors emitted by the representation one, should be good enough tests for both of them.

andreabedini commented 1 year ago

Similarly to what @gbaz says, I welcome and support anybody to work on an independent parsing library.

BUT I also doubt of the long term benefits of such efforts, since Cabal-syntax would remain the reference implementation of the syntax[1] and any separate implementation could not guarantee to stay compatible in the future. Indeed, Hackage has already a handful of different parsers, do they work?

Please, do dive in the code, try your ideas! IMHO the only way out of the tarpit is through iterative improvements. I know the parsing part could be daunting (from memory I count 4 "levels", the lexer, the parser from tokens to [Field Position], the conversion from [Field Position] to Fields (don't even ask), the parser from Fields to GenericPackageDescription (using the FieldGrammar infrastructure)).

Try to take them a part, try to replace one, why do they exist? what do the do? can we evolve it? which bit first? and how?

Many would prefer wqorking on a green field project but Cabal has many years of history and many constraints to satisfy. Here lies the challenge.

[1] which is not formally specification elsewhere, perhaps one could start from this first. But we would need a way to determine if such specification matches the current implementation. Any idea is welcome.

liamzee commented 1 year ago

@andreabedini

But there's short-term benefits, no? Projects like HLS are waiting for a good exact parser of any kind, and while long-term, it might be nice to retrofit the parser in Cabal-syntax, in the short-term, this scratches an itch and allows tooling that requires Cabal exact parsing to move forward.

andreabedini commented 1 year ago

Yes, there are short term benefits. I cannot disagree with that. Exploring the design space is also something that can become a long term benift; but there is also the usual trap of getting 80% of the design right and never finish the remaining 20%.

gbaz commented 1 year ago

The spec doesn't change so very often, I thought.

It changes slightly (in terms of additions of new fields) with almost every Cabal major release:

https://cabal.readthedocs.io/en/stable/file-format-changelog.html

tonyday567 commented 1 year ago

the only way out of the tarpit is through iterative improvements. I know the parsing part could be daunting (from memory I count 4 "levels", the lexer, the parser from tokens to [Field Position], the conversion from [Field Position] to Fields (don't even ask), the parser from Fields to GenericPackageDescription (using the FieldGrammar infrastructure)).

Try to take them a part, try to replace one, why do they exist? what do the do? can we evolve it? which bit first? and how?

I would love to collaborate on a project with this type of specification. Something that could start small - I like the idea of just learning a bit about the answers to these questions. An initial aim could be just to get a "rough guide" to cabal syntax, and the library, that could maybe help the next generation of intrepid explorers.

I don't know too much about the code base, but I have parsed cabal files using the lbrary and some (flatparse) logic.

jappeace commented 1 year ago

@gbaz

That said, I do not see a path to replacing the entire existing cabal parser, with its needs for efficiency, back-compat, etc. At this point what we can do is only evolve it.

I get the impression this is somewhat demotivating to potential new contributors, so I wish to suggest a path to replacing cabal parser:

initial parser/printer is developed as a seperate library,
it provides a mapping (or mappings) to the current cabal ast (if you can call it that), eg ExactAst -> [Field Position] etc.
- This can be tagged by version as well, it doesn't have to be a single mapping!
- This would also give compatibility with all existing cabal code.
we use tests to assert correctness, all golden tests that pass in the current cabal parser should have the same output as the exact print parser and it's mapping:
- we can contribute more tests to the existing cabal library if the maintainers feel they're lacking. This can be upstreamed immediatly!
- the mapping to legacy AST can be used to see if both parsers do the same.
- the exact printer also get's tests described originally.
  - eg: have a byte-for-byte round trip on no-op parse/print and print/parse tests.
we setup benchmarking to assert it's fast enough.

Once the maintainers feel confident the exact print parser is good enough in terms of correctness and performance, we can swap out the legacy legacy parser for the exact print one with help of the mapping.

BurningWitness commented 1 year ago

Having thought about it for a bit more, curly bracket syntax forces parsing to be depth-first, as any deeper level may introduce a breaking offset change:

foo
    bar {
         baz
              this {
that
}
              end }

As such streaming is limited to one top-level element at a time, and to prove the element ends at a correct point its entire skeleton must be parsed.

For any future format-inventors:

I think the sane approach would be to instead use a special literal at the start of the line, only interpreting it as an offset break if its offset is smaller than of the element its inside of:

foo
    bar
>        baz
>              this
>> that
>              end

Same rule would make sense for comments inside fields:

    foo: -- bar    < not a comment
--       baz       < comment
      -- this      < not a comment

andreabedini commented 1 year ago

@jappeace

I understand how @gbaz's comment might have come across as demotivating. But "evolve it" and "replace it" are not necessarily in opposition.

Just like Theseus's ship, we can evolve it by replacing one part at the time (or should I say replace it by evolving one part at the time? :thinking:).

I wish to suggest a path to replacing cabal parser:

I appreciate you have been spending time one this and I will comment on some technical aspects of it; but what I want to understand first is: are you willing to do the work?

initial parser/printer is developed as a seperate library,

This is done, here is a freshly baked new parser written by @VeryMilkyJoe. I also notice that @liamzee has already contributed to it.

I might disagree with some of the technical choices but let's just roll with the CabalAST as it is implemented there.

To have a chance of being able to use it Cabal-syntax, it has to depend only on boot packages so you have to replace megaparsec with parsec.

* it provides a mapping (or mappings) to the current cabal ast (if you can call it that), eg `ExactAst -> [Field Position]` etc.

You are welcome to do this.

  * This can be tagged by version as well, it doesn't have to be a single mapping!
Which version you are talking about here? The Field syntax does not depend on any version. I am not sure whether CabalAST makes more assumptions than Field.
  * This would also give compatibility with all existing cabal code.

At the cost of impoving anything. If Cabal operated on Field Position, we would have lost all the annotations that CabalAST provides.

Also, you will realise that CabalAST does not mention build-depends or anything like that. Why is that? It's because Field (which I believe CabalAST was modelled from) is more basic syntax structure.

The actual fields definition is in the various FieldGrammars, a static definition of the fields that is used for both parsing and pretty-printing. It is general enough it is also used to generate the field syntax reference.

GenericPackageDescription is parsed from Field using this infrastructure. I assume you could parse it from CabalAST instead; but how are you going to keep the annotations?

* we use tests to assert correctness,  all golden tests that pass in the current cabal parser should have the same output as the exact print parser and it's mapping:

Define "output": Field Position, Fields, GenericPackageDescription?

  * we can contribute more tests to the existing cabal library if the maintainers feel they're lacking. This can be upstreamed immediatly!

(while considering myself only a contributor) Yes please!

You can also use the test-suite hackage-tests in Cabal-tests.

  * the mapping to legacy AST can be used to see if both parsers do the same.

Again which legacy AST you are referring to here?

* we setup benchmarking to assert it's fast enough.

In terms of benchmarks I believe we have only cabal-benchmarks/bench/CabalBenchmarks.hs.

As far as I understand:

in terms of speed, timing hackage-tests might be enough
memory usage is also important, the main constraint comes from hackage-server which use Cabal-syntax. Having a benchmark to measure memory consumption in a use-case like hackage-server would be very valuable. You are welcome to contribute one.

Once the maintainers feel confident the exact print parser is good enough in terms of correctness and performance, we can swap out the legacy legacy parser for the exact print one with help of the mapping.

Once the problems above have been resolved, I believe nobody will stand in the way.

I apologise if I have come across a bit rude. By looking at the codebase, I think these are the problems we have to face. I am eager to see them discussed or better still solved.

Of course, I would be also happy to be called out if someone thinks that I am too entrenched in the current architecture and everything can be easily simplied. Show me!

jappeace commented 1 year ago

I appreciate you have been spending time one this and I will comment on some technical aspects of it; but what I want to understand first is: are you willing to do the work?

No the intention was to hire @LemonjamesD to do this, but she has to know what to do. I feel your comment makes it a lot more clear what the actual issue is. Thanks for that :)

At the cost of impoving anything. If Cabal operated on Field Position, we would have lost all the annotations that CabalAST provides.

Yes so I get the impression the best approach is to modify types such as Fields and GenericPackageDescription in place to add exact printing support (1). And deal with the obnoxious amount of compile errors.

Define "output": Field Position, Fields, GenericPackageDescription?

Any of those would work depending on what you're mapping into, if you'd do the seperate AST approach.

Again which legacy AST you are referring to here?

I perhaps mistakenly assumed the current way of parsing needed replacing.

I apologise if I have come across a bit rude. By looking at the codebase, I think these are the problems we have to face. I am eager to see them discussed or better still solved.

It's fine you provided a goldmine of information, at least I can now make a sorta scoped contract rather then "oh go do this thing".

New approach based on (1):

add some sort of pretty function for generic package descriptions:
```
exactPrint :: GenericPackageDescription -> Text
```
(don't know if something such as the pretty package could be used)
setup a extremly basic test for byte for byte roundtrippin.
start adding Position to various fields in generic package description, let the test pass.
add more tests that are more realistic
repeat untill all of hackage can be parsed.

This approach makes the work easier to split as well because you can just make test by test pass. Where a test is a cabal file being roundtripped.

BurningWitness commented 12 months ago

Seems like the last idiosyncrasy on my list is section name parsing, allowing

> foo
>   { bar "b -- {a}:z" }
[Section "foo" [] [Section "bar" [SecArgStr "b -- {a}:z"] []]]

No manifests on Hackage utilize this as a format escape, only ds-kanren/metric patches do. foo:bar I assume clashes directly with the existing console command format of pkg:test:name, while foo bar always has to be quoted. As such I don't see a good reason to maintain any of this.

haskell / cabal

Meta: Exact-printer Mega-issue #7544

What is this issue?

What is it?

Why do we need it?

Who's in charge of this?