lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.81k stars 409 forks source link

Improvements to the documentation: feedback from a new user #134

Closed nomorepanic closed 4 years ago

nomorepanic commented 6 years ago

Hi :) I have been building a parser with lark, and got it working by reading docs and examples. However it was not straightforward.

The Github wiki is harder to navigate than readthedocs or other similar alternatives like mkdocs and it's a painful experience if you don't know what you are looking for.

The tutorials are really good to get an idea of what lark can do in simple situation, but bridging the gap to real-world usage takes a while because is very hard to find good and descriptive EBNF docs. I worked by googling and glueing pieces together. You should consider on adding links to more EBNF-related material or covering more of the EBNF syntax, for example I did not find anything about and

The examples are good, but the best ones are in the examples folder linked from the page and nothing else is said about them. The indented tree example is very good, but a distracted reader might easily miss it or assume that the important ones are in the wiki.

The classes reference is missing the Indenter class and details on implementing a custom indenter class are absent from the docs. The Tree class documentation is particularly unhelpful:

erezsh commented 6 years ago

Hi, thanks for the feedback.

I'm thinking of writing a grammar cheatsheet. Would that help?

the best ones are in the examples folder linked from the page and nothing else is said about them

The examples page contains explanations of each example. Is it not enough? Or do you mean something else?

Docs were never my strongsuit, but I'll work on improving them. If you have any suggestions for tools/practices I'll be happy to hear about them.

nomorepanic commented 6 years ago

That could really help.

I meant that having the list of examples in the docs too would be better.

brupelo commented 6 years ago

I'm thinking of writing a grammar cheatsheet. Would that help?

That'd be helpful indeed, some guys like me are just lazy when it comes to read very meaty doc/specs pages, even more if they've been generated automatically out of the source code. I do really love using lightweight docs like cheatsheets or learning by examples... Good about that type of learning is you know what's possible to achieve with a particular piece of tech and then... if you want to know more, you just dig into the source code, instead reading nonsense docs. Source code should be self-explanatory by itself.

Coming up with good documentation/examples/use-cases is one of the most important things to make a project become more popular... at the end of the day, people is more interested on how to use a piece of tech more than the piece of tech itself.

Examples:

So... yeah, more examples & cheatsheets and less boring docs ;D

erezsh commented 6 years ago

Thanks for the input.

guys like me are just lazy

Sometimes I am too :)

Source code should be self-explanatory by itself

I hope most of the Lark code is reasonably easy to understand (although it certainly requires some background in parsing at times)

lightweight docs like cheatsheets

I want to create a cheatsheet, but I have no idea where to start. Do you know if there's a tool I can use, or do I have to dust up my photoshop skills?

brupelo commented 6 years ago

Do you know if there's a tool I can use, or do I have to dust up my photoshop skills?

Good question, depends which operating system you're using... if it was me I'd say using latex or markdown is just fine, that way people can contribute easily on github. Take a look to this one, it has a bunch of ready-to-go templates and it also provides a WYSIWYG where you can export to pdf/latex.

If you don't like that one, there are fancier ones... just google "online cheatsheet generator" and you'll get a bunch of online ones targeting the particular task of creating cheat-sheet, ie: cheatography.

That said, if you're more a desktop guy (I am)... don't ask me about it... it's a good question and I'd also like to know which alternatives we've got here on desktop :D . Although I think using tools like photoshop would definitely be quite time consuming, on the other hand, you if you know what you're doing with this type of software you'd be able to create really cool stuff, I don't think that's the point... we casual users want to learn how to use properly lark at the end of the day, a basic plain cheatsheet explaining all the concepts with simple examples will do ;D

erezsh commented 6 years ago

@brupelo Added a cheatsheet, what do you think? https://github.com/lark-parser/lark/blob/master/docs/lark_cheatsheet.pdf

erezsh commented 6 years ago

@Vesuvium Would be happy to hear your opinion as well.

nomorepanic commented 6 years ago

I think it's great :1st_place_medal:

brupelo commented 6 years ago

@erezsh Cool stuff indeed. This nice job deserves a proper review, so as casual user myself, let me give it a shot:

Strong points

1) The whole library can be described in 1 page already, that's great, it means from the point of view of usability the library is quite sanitized. Public interface of libraries should be minimal & complete. 2) You've decided to make the cheatsheet by describing library/grammar concepts instead summing up the functionality by using snippets, like this one. I do think that's the right approach with this type of library, which is quite conceptual

Weak points

1) Cheatsheets shouldn't contain redundant/noise information, they should be as compact as possible. Said otherwise, if something is not adding new relevant information to the user, shouldn't be there. For instance, if you say x=666 you don't need to also say x=666 because x=333+333 or x=111*6 2) Header/Footer: They're just "stealing" space, that means you can't add more valuable information (if necessary). (Sorry about my poor english, do you use steal or waste here? :))

Cards

1) Lark options: If I read parser="earley" is clear to me that's the type of parser that will be used by lark, so the comment Use the Earley parser becomes redundant. The default part is fine though as it adds new relevant info. Same with parser=lalr and parser=cyk. Also, because those redundant comments the rows have become 2 lines instead 1 (wasted space for no reason). Same with lexer=standard, start=foo. I think the rest of options from this card are already fine 2) Token reference: token.type token.value token.line token.column token.end_line token.end_column are self-explanatory variable names, those comments are redundant 3) Grammar definitions: This card is ok 4) Grammar patterns: This card is ok 5) Terminal atoms: This card is ok 6) Tree shaping: Mmmm, I think this card is quite confusing. For instance, I see this one but still all those concepts are confusing to me (as a user)... same happened to me when I've also read the section https://github.com/lark-parser/lark/blob/master/docs/reference.md#shaping-the-tree. Which means, either "I'm a little bit slow myself" or "the docs are not clear enough" ;) 7) Tree reference: The tree methods are self-explanatory already by name, which is good, those comments are redundants.

Relevant info

1) comments-are-a-code-smell 2) https://refactoring.guru/smells/comments 3) refactoring-comments-into-better-code

Finally, here's a little visual summary of the redundancies I see

All in all, if this has been your first cheatsheet... in general lines, it's not bad first attempt at all ;)

Hope this little review will add some useful stuff to make the cheatsheet better.

B.

erezsh commented 6 years ago

@brupelo Thanks for the review. If I understand your criticism correctly, it can be summed up in two points:

  1. Be less verbose, don't explain the self-explanatory.
  2. Tree shaping is confusing, so make it clearer

Regarding the wasted space, I agree, but I'll worry about that when I have too little space left.

brupelo commented 6 years ago

@erezsh Yeah, pretty much... about wasted space it's also a matter about being able to see the whole cheatseet at 100% zooming on the screen at once without scrolling. I like printing cheatsheets but it's also nice if they can be view at once on the screen. Btw, it's funny... I talk a lot about verbosity/redundancy and almost 99% of my comments are quite verbose/redundant, that's a clear symptom I need to become a more effective communicator and improving my poor English ;) .

blaiseli commented 6 years ago

Just to say that I'm a first-time user of this kind of tool (lark) and the cheatsheet has been useful for me (even though I still haven't reached my goal).

erezsh commented 6 years ago

To whomever it may concern:

I'm in the process of moving the documentation from the github's wiki to readthedocs: https://lark-parser.readthedocs.io

Any input will be appreciated.

nomorepanic commented 6 years ago

That's great. Readthedocs is easier to navigate because of the left-hand menu

whitten commented 6 years ago

Looked over the cheat sheet. no idea what "standard" lexer means. My initial guess is that you have certain patterns that are very common, like does the lexer generate single tokens for keywords (like the SQL standard requires) i.e. some sequences of characters can only be used in rules in certain places. If the sequence is a keyword, you can't use it in any other place. SQL requires that if you use the word ORDER, it must never be used as a column name or a table name, as an example. the letters can be used inside a quoted string, but not as a name.

And of course, there is the opposite pattern, of saying the same string of letters may mean a variable name sometimes, and a type name sometimes, or sometimes a keyword. If you have a keyword tokenizer, it is a real pain to handle this pattern, as it complicates a lot of grammar rules.

Another common keyword like pattern is when multiple punctuation characters in a particular order are considered to be separate tokens, like != or <> or ** or && etc. Getting the lexer to generate those tokens is really nice.

Which brings up quoted strings. There are several characters used for quoted strings like (") or (') or (`) or (|) that begin and end particular types of strings. Systems seem to have only a few different ways to put the quoting character inside the string. Either doubling the start/end quote, or having a "escape" character that says the next character loses its special quoting quality like the backslash i.e. \" or maybe just requiring a two different kinds of quotes, and using the other one to mark strings that have first quote character inside.

You should be able to use certain quoting characters as stand alone tokens and not as begin and end quotes, as some languages use only some of them as quoting markers, and use the others as, for example, operators. This kind of recognizing can be stated in the rules, but the lexer shouldn't keep you from doing what you need to do.

Another common lexer option is to take a sequence of numeric characters and return it as a special token which has the number value of that sequence.

Yet another issue is character pairs that must be balanced like () or {} or <> or [] If you can find a way to recognize these kinds of patterns quickly, it will speed up grammar development (and error processing).

If you handle the most common kinds of patterns in the lexer, it makes life easier and quicker

erezsh commented 4 years ago

I think this discussion has run its course. But feel free to continue the conversation or create a new issue.