miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
841 stars 119 forks source link

Add __repr__ methods to all token classes #140

Closed doerwalter closed 2 years ago

doerwalter commented 2 years ago

This pull request adds a __repr__ method to all token classes. To simplify implementation it introduces a common base class token.Token for block_token.BlockToken and span_taken.SpanToken. For the following example program:

from mistletoe import block_token, span_token, utils
d = block_token.Document(open("test/samples/quotes.md", "r").read())
print([tr.node for tr in utils.traverse(d)])

The output looks like this without __repr__ methods:

[<mistletoe.block_token.Heading at 0x10fa5aec0>,
 <mistletoe.block_token.Quote at 0x10f99e2c0>,
 <mistletoe.block_token.Paragraph at 0x10fb95a50>,
 <mistletoe.block_token.Quote at 0x10fb94130>,
 <mistletoe.block_token.Paragraph at 0x10fb97970>,
 <mistletoe.block_token.Quote at 0x10fb94c10>,
 <mistletoe.block_token.Paragraph at 0x10fbc00d0>,
 <mistletoe.block_token.Quote at 0x10fbc0d30>,
 <mistletoe.block_token.Paragraph at 0x10fbc1270>,
 <mistletoe.span_token.RawText at 0x10fa5b4f0>,
 <mistletoe.block_token.Paragraph at 0x10f99d8a0>,
 <mistletoe.span_token.RawText at 0x10fb95a20>,
 <mistletoe.block_token.Paragraph at 0x10fb97910>,
 <mistletoe.block_token.Paragraph at 0x10fb97c70>,
 <mistletoe.span_token.RawText at 0x10fb97ca0>,
 <mistletoe.block_token.List at 0x10fbc0c40>,
 <mistletoe.span_token.RawText at 0x10fbc0580>,
 <mistletoe.block_token.Quote at 0x10fbc0850>,
 <mistletoe.block_token.Paragraph at 0x10fbc0fd0>,
 <mistletoe.block_token.Quote at 0x10fbc1090>,
 <mistletoe.span_token.RawText at 0x10fbc12a0>,
 <mistletoe.span_token.RawText at 0x10f99cd60>,
 <mistletoe.span_token.RawText at 0x10fb94160>,
 <mistletoe.span_token.RawText at 0x10fb948b0>,
 <mistletoe.block_token.ListItem at 0x10fbc0880>,
 <mistletoe.block_token.ListItem at 0x10fbc03d0>,
 <mistletoe.block_token.Paragraph at 0x10fbc0f10>,
 <mistletoe.span_token.RawText at 0x10fbc1000>,
 <mistletoe.block_token.Paragraph at 0x10fbc10f0>,
 <mistletoe.block_token.Paragraph at 0x10fbc11b0>,
 <mistletoe.block_token.Paragraph at 0x10fbc04f0>,
 <mistletoe.block_token.Paragraph at 0x10fbc0460>,
 <mistletoe.span_token.RawText at 0x10fbc0f70>,
 <mistletoe.span_token.RawText at 0x10fbc1120>,
 <mistletoe.span_token.RawText at 0x10fbc11e0>,
 <mistletoe.span_token.RawText at 0x10fbc0640>,
 <mistletoe.span_token.RawText at 0x10fbc0040>]

and like this with __repr__ methods:

[<mistletoe.block_token.Heading with 1 child content='Quotes' level=2 at 0x10f748c40>,
 <mistletoe.block_token.Quote with 1 child at 0x10f748250>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f776830>,
 <mistletoe.block_token.Quote with 2 children at 0x10f7748b0>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f777310>,
 <mistletoe.block_token.Quote with 1 child at 0x10f774220>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f776440>,
 <mistletoe.block_token.Quote with 3 children at 0x10f7b9c90>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7b82e0>,
 <mistletoe.span_token.RawText content='Quotes' at 0x10f748340>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f664a90>,
 <mistletoe.span_token.RawText content='A response to single quote.' at 0x10f776ec0>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f774370>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f777490>,
 <mistletoe.span_token.RawText content='Quote with a list inside:' at 0x10f7776d0>,
 <mistletoe.block_token.List with 2 children loose=False start=1 at 0x10f7762f0>,
 <mistletoe.span_token.RawText content='Nested quotes:' at 0x10f7b9150>,
 <mistletoe.block_token.Quote with 1 child at 0x10f7ba410>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7b8580>,
 <mistletoe.block_token.Quote with 2 children at 0x10f7b8b50>,
 <mistletoe.span_token.RawText content='Another paragraph.' at 0x10f7b8be0>,
 <mistletoe.span_token.RawText content='A single quote' at 0x10f665060>,
 <mistletoe.span_token.RawText content='A quote spreading...' at 0x10f777be0>,
 <mistletoe.span_token.RawText content='... multiple paragraphs' at 0x10f776bc0>,
 <mistletoe.block_token.ListItem with 1 child leader='1.' prepend=3 loose=False at 0x10f775d50>,
 <mistletoe.block_token.ListItem with 1 child leader='2.' prepend=3 loose=False at 0x10f776d40>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7ba3e0>,
 <mistletoe.span_token.RawText content='Quoted paragraph.' at 0x10f7b8460>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7b9570>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7b9ae0>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7749a0>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f776b30>,
 <mistletoe.span_token.RawText content='Nested line quote' at 0x10f7b8640>,
 <mistletoe.span_token.RawText content='Nested block quote.' at 0x10f7b86a0>,
 <mistletoe.span_token.RawText content='Jira does not seem to support '...+51 at 0x10f7b93f0>,
 <mistletoe.span_token.RawText content='first' at 0x10f776260>,
 <mistletoe.span_token.RawText content='second' at 0x10f7771c0>]
pbodnar commented 2 years ago

@doerwalter, thanks for this PR, it looks promising. One can already see parsed content with metadata via the AST renderer, but this is more light-weight & can be used anywhere.

Do you think you could also add a test or two to test this new enhancement? I know this seems like something that should not break easily, yet it is always good to have the code covered...

doerwalter commented 2 years ago

OK, I've added tests for most token classes. FootNote, HTMLBlock, SetextHeading, XWikiBlockMacroStart and XWikiBlockMacroEnd are missing since I didn't manage to come up with sourcecode that produces them.

doerwalter commented 2 years ago

[...] I've got just a few remarks written directly to the individual lines. Plus regarding the tests as a whole, if you would have time to check:

  1. It is always questionable whether to create the tested tokens by directly instantiating them, or by parsing them. But the parsing approach you chose is probably mostly OK in this case.

OK.

I'm also not sure whether most tokens even could be created by directly instantiating them from their constituent parts, as most __init__ methods require a match argument.

It would be great if it was possible to create tokens in this way, for example:

link = span_token.Link(title="mistletoe", target="https://pypi.org/project/mistletoe/")

but that's another project.

  1. To avoid repetition and to have the tests a little bit more "outside change resistant", here are some ideas for improvements: a) pass only what is different to the helper check_repr_prefix() method and check the presence of at 0x... in it explicitly,

I'm not exactly sure how the change would look like, can you give me an example?

b) with a few exceptions, for example span tokens are always searched inside parsed doc.children[0].children[0] -> introduce for example check_span_token_repr() or get_span_token(doc) to have this fact just at 1 place?

Unfortunately, the span token that the test extracts is not always doc.children[0]. Sometimes it's doc.children[1] and sometimes it's a child token of doc.children[0]. So I'm not sure, how we could accomplish moving the access to the span tokens out into a function.

But maybe we could add a __getitem__ to token.Token, so that doc.children[0].children[0] could be written as doc[0][0] but unfortunately that doesn't mesh with what __contains__ does. (I would expect a in b to be the same as any(c == a for c in b)).

BTW We should probably have all those helper (not-really-public) methods prefixed with _, but that is just a minor issue I guess...

I've renamed short_repr to _short_repr. Should I rename repr_attributes too? (Or maybe we should document this attribute, so that custom token classes could use it too.)

pbodnar commented 2 years ago

To avoid repetition and to have the tests a little bit more "outside change resistant", here are some ideas for improvements: a) pass only what is different to the helper check_repr_prefix() method and check the presence of at 0x... in it explicitly,

I'm not exactly sure how the change would look like, can you give me an example?

I was thinking about replacing for example this:

self.check_repr_prefix(doc, "<mistletoe.block_token.Document with 1 child at 0x")

... with this:

self._check_repr_matches(doc, "block_token.Document with 1 child")

... while the _check_repr_matches() would be the original method which would newly also check the common prefix and suffix is present.

pbodnar commented 2 years ago

b) with a few exceptions, for example span tokens are always searched inside parsed doc.children[0].children[0] -> introduce for example check_span_token_repr() or get_span_token(doc) to have this fact just at 1 place?

Unfortunately, the span token that the test extracts is not always doc.children[0]. Sometimes it's doc.children[1] and sometimes it's a child token of doc.children[0]. So I'm not sure, how we could accomplish moving the access to the span tokens out into a function.

But maybe we could add a __getitem__ to token.Token, so that doc.children[0].children[0] could be written as doc[0][0] but unfortunately that doesn't mesh with what __contains__ does. (I would expect a in b to be the same as any(c == a for c in b)).

Well, I'm not sure now really. :) I thought some of the testing inputs of span tokens could be changed, so that doc.children[0].children[0] is nearly always used. But I don't want to bother you too much with this, I think we can live with how it is coded for now and possibly do some refactoring later on if really needed...

pbodnar commented 2 years ago

BTW We should probably have all those helper (not-really-public) methods prefixed with _, but that is just a minor issue I guess...

I've renamed short_repr to _short_repr. Should I rename repr_attributes too? (Or maybe we should document this attribute, so that custom token classes could use it too.)

OK, good question. I agree that documenting the repr_attributes field would not be bad. I've done research on how to document class attributes in Python - it looks like putting a simple string literal beneath it is sufficient (see SO answer). BTW You could also move the Base token class. pydoc after the class Token line when at it, I believe that would be more correct.

doerwalter commented 2 years ago

I was thinking about replacing for example this:

self.check_repr_prefix(doc, "<mistletoe.block_token.Document with 1 child at 0x")

... with this:

self._check_repr_matches(doc, "block_token.Document with 1 child")

... while the _check_repr_matches() would be the original method which would newly also check the common prefix and suffix is present.

OK, done.

doerwalter commented 2 years ago

OK, good question. I agree that documenting the repr_attributes field would not be bad. I've done research on how to document class attributes in Python - it looks like putting a simple string literal beneath it is sufficient (see SO answer). BTW You could also move the Base token class. pydoc after the class Token line when at it, I believe that would be more correct.

OK, I've added a docstring to token.Token. I've included the documentation for Token.repr_attributes in the class docstring, since it's a little bit longer and the builtin help() function doesn't pick up the "attribute docstring", if I put it after the attribute. I.e. when I use the following Token class:

class Token:
    """
    Base token class.

    Token has two subclasses:

    * block_token.BlockToken, for all block level tokens. A block level token
      is text which occupies the entire horizontal width of the "page" and is
      offset for the surrounding sibling block with line breaks.

    * span_token.SpanToken. for all span-level (or inline-level) tokens.
      A span-level token appears inside the flow of the text lines without any
      surrounding line break.

    """

    repr_attributes = ()
    """
    Custom __repr__ methods in subclasses: The default __repr__ implementation
    outputs the number of child tokens (from the attribute children) if
    applicable, and the content attribute if applicable. If any additional
    attributes should be included in the __repr__ output, this can be specified
    by setting the class attribute repr_attributes to a tuple containing the
    attribute names to be output.
    """

the test script

from mistletoe import token
help(token.Token)

gives me:

Help on class Token in module mistletoe.token:

class Token(builtins.object)
 |  Base token class.
 |
 |  Token has two subclasses:
 |
 |  * block_token.BlockToken, for all block level tokens. A block level token
 |    is text which occupies the entire horizontal width of the "page" and is
 |    offset for the surrounding sibling block with line breaks.
 |
 |  * span_token.SpanToken. for all span-level (or inline-level) tokens.
 |    A span-level token appears inside the flow of the text lines without any
 |    surrounding line break.
 |
 |  Methods defined here:
 |
 |  __repr__(self)
 |      Return repr(self).
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |
 |  repr_attributes = ()

BTW, do we want to use ReST markup for the docstrings, or Markdown or do we want to keep the docstring in plain text?

doerwalter commented 2 years ago

Thank you, I think it is OK as you did it.

OK.

Yet, just a note, if we can believe what is written on the linked page, this (i. e. docstrings below attributes) is not implemented in Python "core" (like help()) for some reason,

The reason for that is that normal docstrings will be added as a __doc__ attribute to the documented object (module, class, function) by the compiler. So showing such a doc string can simply be done via introspection. But for an "attribute docstring" (which is simply a string constant after the documented attribute) the string will vanish from the compiled code (as there is no reliable way to attach the docstring to anything), so for tools to be able to show such a string, they would have to find the module source code, load it (using the correct text encoding), and correlate the source code with the compiled module content. This is several orders of magnitude more complicated than just doing print(thing.__doc__).

but it is de-facto standard supported by editors like VS Code and various documentation generators. So I wouldn't personally be afraid of using it in the future.

The linked article states that Sphinx supports it too, but I couldn't get it to work. I tried it in another project with the .. automodule:: directive, and the attribute doesn't show up. Even if I use both the .. automodule:: and the .. autoattribute:: directive it doesn't work.

BTW, do we want to use ReST markup for the docstrings, or Markdown or do we want to keep the docstring in plain text?

Again, I'm not an expert, but I think "Markdown with ReST extras" is just fine here, while keeping some consistency across the code. BTW, you can see a recent discussion on how to use single/ double backticks here.

OK, if you want, I can create another pull request that updates the docstrings to Markdown. But I believe that mixing Markdown and ReST might be problematic. Backticks work differently for example:

In Markdown single backticks are used for computer code, and you can surround your computer code with multiple backticks if your computer code includes backticks itself (which doesn't seem to work if the backticks are at the beginning and end of the code, and Github doesn't seem to support this, so I can't show you an example ;)).

In ReST generic computer code is marked up with double backticks. Single backticks are used for "interpreted text roles" which look like this:

:class:`token.Token`

(see https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html for interpreted text roles in Sphinx.)

If you omit the role name (class in the above example), ReST/docutils/Sphinx will use a default text role (which for Sphinx can be configured via default_role (see https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-default_role).

So IMHO it would be better to convert the docstring to ReST.

pbodnar commented 2 years ago

@doerwalter, thanks for your analysis, it is an interesting topic, although we got a little bit outside of the scope of this issue... ;)

... The reason for that is that normal docstrings will be added as a __doc__ attribute to the documented object (module, class, function) by the compiler. So showing such a doc string can simply be done via introspection. ... This is several orders of magnitude more complicated than just doing print(thing.__doc__).

I see, the reason is mentioned in the linked SO page, the reasons behind deciding not to add __doc__ to attributes as well are more blurry though, but I guess we can't do anything about it here. ;) In short, I use primarily what VS Code parses (possibly "going against Green ICT", but maybe they would do extra parsing anyway, I don't know) and hints me anyway, so no big deal of using docstring at attributes, yet I can imagine someone like you may need to see the complete docs elsewhere too. So OK, let's leave it in the class docstring for now.

... OK, if you want, I can create another pull request that updates the docstrings to Markdown. But I believe that mixing Markdown and ReST might be problematic. ...

So IMHO it would be better to convert the docstring to ReST.

Yes, as you and the linked comment state, when choosing ReST, simply use doubled backticks for any code and single backticks for code which we would like to have also linked to an existing code construct. So I would keep it this way, it is used in some other code already. That said, I'm not quite sure if also adding a "role" before single backticks really pays off, I'm rather against it in order to KISS, but if you persuade me it really has some great benefits, maybe I would change my mind... :)

doerwalter commented 2 years ago

OK, so should we merge this pull request, so I can open another one for adding proper markup to the docstrings, or should I update the pull request to add that? (But probably just to the new docstrings, not to the existing ones, which, I guess, is out of scope for this pull request.)

pbodnar commented 2 years ago

Hi, I would choose to update just the new docstrings here. And if you agree, I would select "Squash and merge" to merge this PR afterwards, I think it is good for situations like this one (I haven't tried it here on GitHub before).

doerwalter commented 2 years ago

OK, I've updated the docstrings in token.py.

pbodnar commented 2 years ago

OK, merged, thank you! :)