Open emilypi opened 3 years ago
The concern would be that it might not keep pace with the spec
The spec doesn't change so very often, I thought.
I do not see a path to replacing the entire existing cabal parser, with its needs for efficiency, back-compat, etc.
While it is a lot of work due to the sheer number of field formats and caveats, I don't see either efficiency or backwards compatibility as issues:
Exact parser should be able to stream its results one node at a time. One node in this case is (section line* | field line* | comment | newline)
, where line*
in sections can be parsed into node*
. Some minor inefficiencies will arise due to streaming, e.g. parsing some of the next line just to know if it belongs to this node or not, but I wouldn't sweat it.
Every format version can have a separate representation (data Cabal = CabalV1_0 ... | CabalV2_0 ... | ...
), each with its own exact -> representation
parser. Due to the aforementioned lack of changes between versions, the only version that truly matters is the latest one, all previous ones are simply clones with some fields added or removed.
On top of it all of this should be completely pure: downloading all of Hackage, and then checking exactness of the exact parser and errors emitted by the representation one, should be good enough tests for both of them.
Similarly to what @gbaz says, I welcome and support anybody to work on an independent parsing library.
BUT I also doubt of the long term benefits of such efforts, since Cabal-syntax
would remain the reference implementation of the syntax[1] and any separate implementation could not guarantee to stay compatible in the future. Indeed, Hackage has already a handful of different parsers, do they work?
Please, do dive in the code, try your ideas! IMHO the only way out of the tarpit is through iterative improvements. I know the parsing part could be daunting (from memory I count 4 "levels", the lexer, the parser from tokens to [Field Position]
, the conversion from [Field Position]
to Fields
(don't even ask), the parser from Fields
to GenericPackageDescription
(using the FieldGrammar
infrastructure)).
Try to take them a part, try to replace one, why do they exist? what do the do? can we evolve it? which bit first? and how?
Many would prefer wqorking on a green field project but Cabal has many years of history and many constraints to satisfy. Here lies the challenge.
[1] which is not formally specification elsewhere, perhaps one could start from this first. But we would need a way to determine if such specification matches the current implementation. Any idea is welcome.
@andreabedini
But there's short-term benefits, no? Projects like HLS are waiting for a good exact parser of any kind, and while long-term, it might be nice to retrofit the parser in Cabal-syntax
, in the short-term, this scratches an itch and allows tooling that requires Cabal exact parsing to move forward.
Yes, there are short term benefits. I cannot disagree with that. Exploring the design space is also something that can become a long term benift; but there is also the usual trap of getting 80% of the design right and never finish the remaining 20%.
The spec doesn't change so very often, I thought.
It changes slightly (in terms of additions of new fields) with almost every Cabal major release:
https://cabal.readthedocs.io/en/stable/file-format-changelog.html
the only way out of the tarpit is through iterative improvements. I know the parsing part could be daunting (from memory I count 4 "levels", the lexer, the parser from tokens to
[Field Position]
, the conversion from[Field Position]
toFields
(don't even ask), the parser fromFields
toGenericPackageDescription
(using theFieldGrammar
infrastructure)).Try to take them a part, try to replace one, why do they exist? what do the do? can we evolve it? which bit first? and how?
I would love to collaborate on a project with this type of specification. Something that could start small - I like the idea of just learning a bit about the answers to these questions. An initial aim could be just to get a "rough guide" to cabal syntax, and the library, that could maybe help the next generation of intrepid explorers.
I don't know too much about the code base, but I have parsed cabal files using the lbrary and some (flatparse) logic.
@gbaz
That said, I do not see a path to replacing the entire existing cabal parser, with its needs for efficiency, back-compat, etc. At this point what we can do is only evolve it.
I get the impression this is somewhat demotivating to potential new contributors, so I wish to suggest a path to replacing cabal parser:
ExactAst -> [Field Position]
etc.
Once the maintainers feel confident the exact print parser is good enough in terms of correctness and performance, we can swap out the legacy legacy parser for the exact print one with help of the mapping.
Having thought about it for a bit more, curly bracket syntax forces parsing to be depth-first, as any deeper level may introduce a breaking offset change:
foo
bar {
baz
this {
that
}
end }
As such streaming is limited to one top-level element at a time, and to prove the element ends at a correct point its entire skeleton must be parsed.
For any future format-inventors:
I think the sane approach would be to instead use a special literal at the start of the line, only interpreting it as an offset break if its offset is smaller than of the element its inside of:
foo
bar
> baz
> this
>> that
> end
Same rule would make sense for comments inside fields:
foo: -- bar < not a comment
-- baz < comment
-- this < not a comment
@jappeace
I understand how @gbaz's comment might have come across as demotivating. But "evolve it" and "replace it" are not necessarily in opposition.
Just like Theseus's ship, we can evolve it by replacing one part at the time (or should I say replace it by evolving one part at the time? :thinking:).
I wish to suggest a path to replacing cabal parser:
I appreciate you have been spending time one this and I will comment on some technical aspects of it; but what I want to understand first is: are you willing to do the work?
- initial parser/printer is developed as a seperate library,
This is done, here is a freshly baked new parser written by @VeryMilkyJoe. I also notice that @liamzee has already contributed to it.
I might disagree with some of the technical choices but let's just roll with the CabalAST
as it is implemented there.
To have a chance of being able to use it Cabal-syntax, it has to depend only on boot packages so you have to replace megaparsec with parsec.
* it provides a mapping (or mappings) to the current cabal ast (if you can call it that), eg `ExactAst -> [Field Position]` etc.
You are welcome to do this.
* This can be tagged by version as well, it doesn't have to be a single mapping!
Which version you are talking about here? The
Field
syntax does not depend on any version. I am not sure whetherCabalAST
makes more assumptions thanField
.* This would also give compatibility with all existing cabal code.
At the cost of impoving anything. If Cabal operated on Field Position
, we would have lost all the annotations that CabalAST
provides.
Also, you will realise that CabalAST
does not mention build-depends
or anything like that. Why is that? It's because Field
(which I believe CabalAST
was modelled from) is more basic syntax structure.
The actual fields definition is in the various FieldGrammar
s, a static definition of the fields that is used for both parsing and pretty-printing. It is general enough it is also used to generate the field syntax reference.
GenericPackageDescription
is parsed from Field
using this infrastructure. I assume you could parse it from CabalAST
instead; but how are you going to keep the annotations?
* we use tests to assert correctness, all golden tests that pass in the current cabal parser should have the same output as the exact print parser and it's mapping:
Define "output": Field Position
, Fields
, GenericPackageDescription
?
* we can contribute more tests to the existing cabal library if the maintainers feel they're lacking. This can be upstreamed immediatly!
(while considering myself only a contributor) Yes please!
You can also use the test-suite hackage-tests
in Cabal-tests
.
* the mapping to legacy AST can be used to see if both parsers do the same.
Again which legacy AST you are referring to here?
* we setup benchmarking to assert it's fast enough.
In terms of benchmarks I believe we have only cabal-benchmarks/bench/CabalBenchmarks.hs
.
As far as I understand:
hackage-tests
might be enoughOnce the maintainers feel confident the exact print parser is good enough in terms of correctness and performance, we can swap out the legacy legacy parser for the exact print one with help of the mapping.
Once the problems above have been resolved, I believe nobody will stand in the way.
I apologise if I have come across a bit rude. By looking at the codebase, I think these are the problems we have to face. I am eager to see them discussed or better still solved.
Of course, I would be also happy to be called out if someone thinks that I am too entrenched in the current architecture and everything can be easily simplied. Show me!
I appreciate you have been spending time one this and I will comment on some technical aspects of it; but what I want to understand first is: are you willing to do the work?
No the intention was to hire @LemonjamesD to do this, but she has to know what to do. I feel your comment makes it a lot more clear what the actual issue is. Thanks for that :)
At the cost of impoving anything. If Cabal operated on Field Position, we would have lost all the annotations that CabalAST provides.
Yes so I get the impression the best approach is to modify types such as Fields
and GenericPackageDescription
in place to add exact printing support (1).
And deal with the obnoxious amount of compile errors.
Define "output": Field Position, Fields, GenericPackageDescription?
Any of those would work depending on what you're mapping into, if you'd do the seperate AST approach.
Again which legacy AST you are referring to here?
I perhaps mistakenly assumed the current way of parsing needed replacing.
I apologise if I have come across a bit rude. By looking at the codebase, I think these are the problems we have to face. I am eager to see them discussed or better still solved.
It's fine you provided a goldmine of information, at least I can now make a sorta scoped contract rather then "oh go do this thing".
New approach based on (1):
exactPrint :: GenericPackageDescription -> Text
(don't know if something such as the pretty
package could be used)
This approach makes the work easier to split as well because you can just make test by test pass. Where a test is a cabal file being roundtripped.
Seems like the last idiosyncrasy on my list is section name parsing, allowing
> foo
> { bar "b -- {a}:z" }
[Section "foo" [] [Section "bar" [SecArgStr "b -- {a}:z"] []]]
No manifests on Hackage utilize this as a format escape, only ds-kanren/metric patches do. foo:bar
I assume clashes directly with the existing console command format of pkg:test:name
, while foo bar
always has to be quoted. As such I don't see a good reason to maintain any of this.
What is this issue?
A central place to discuss the issue of a Cabal exact-printer so that we can push forward the work all in one place. Currently, The discussion stretches back across many issues over the past 6 years.
What is it?
An exact-printer, as inspired by
ghc-exactprint
, is a byte-for-byte bidirectional parser and pretty printer, which gaurantees the following constructs exist and are principled:.cabal
file source tree representation, which is in bijection with the package description for a.cabal
file for a given spec version. This representation should be parametrized with all offset and encoding information along with file data..cabal
file entries. I should able to update and transform individual components of data deep in the representation without modifying the broader structure.cabal-testsuite
andcabal install
's test suite for the individual commands.Why do we need it?
This work is tied very closely with the
format
,init
,gen-bounds
and other work as shown in historical discussions:This would free us up for several important quality of life improvements to the
cabal install
ecosystem, as well as general consumers of theCabal
library API. We would like to have this in byCabal-3.8.0.0
. The important bits this enables include:cabal gen-bounds
- will be able to updatebuild-depends
stanzas in a principled manner, so that bounds management will be automatic in the future if the user so desires.cabal format
- we hope that a.cabal
format can have a canonical shape, so we can limit the amount of deviations from the norm, or ad-hoc formatting choices present in the ecosystem.cabal init
becomes much easier to manage, and we no longer have to dedicate redundant types, formatting, and pretty printing to this effort.Who's in charge of this?
I am currently overseeing @ptkato who has been tasked with taking this on. Please consider him and myself a point of contact for this work, and join us in
libera.chat#hackage
to discuss.@ptkato @gbaz @davean