Closed doerwalter closed 2 years ago
@doerwalter, thanks for this PR, it looks promising. One can already see parsed content with metadata via the AST renderer, but this is more light-weight & can be used anywhere.
Do you think you could also add a test or two to test this new enhancement? I know this seems like something that should not break easily, yet it is always good to have the code covered...
OK, I've added tests for most token classes. FootNote
, HTMLBlock
, SetextHeading
, XWikiBlockMacroStart
and XWikiBlockMacroEnd
are missing since I didn't manage to come up with sourcecode that produces them.
[...] I've got just a few remarks written directly to the individual lines. Plus regarding the tests as a whole, if you would have time to check:
- It is always questionable whether to create the tested tokens by directly instantiating them, or by parsing them. But the parsing approach you chose is probably mostly OK in this case.
OK.
I'm also not sure whether most tokens even could be created by directly instantiating them from their constituent parts, as most __init__
methods require a match
argument.
It would be great if it was possible to create tokens in this way, for example:
link = span_token.Link(title="mistletoe", target="https://pypi.org/project/mistletoe/")
but that's another project.
- To avoid repetition and to have the tests a little bit more "outside change resistant", here are some ideas for improvements: a) pass only what is different to the helper
check_repr_prefix()
method and check the presence ofat 0x...
in it explicitly,
I'm not exactly sure how the change would look like, can you give me an example?
b) with a few exceptions, for example span tokens are always searched inside parsed
doc.children[0].children[0]
-> introduce for examplecheck_span_token_repr()
orget_span_token(doc)
to have this fact just at 1 place?
Unfortunately, the span token that the test extracts is not always doc.children[0]
. Sometimes it's doc.children[1]
and sometimes it's a child token of doc.children[0]
. So I'm not sure, how we could accomplish moving the access to the span tokens out into a function.
But maybe we could add a __getitem__
to token.Token
, so that doc.children[0].children[0]
could be written as doc[0][0]
but unfortunately that doesn't mesh with what __contains__
does. (I would expect a in b
to be the same as any(c == a for c in b)
).
BTW We should probably have all those helper (not-really-public) methods prefixed with
_
, but that is just a minor issue I guess...
I've renamed short_repr
to _short_repr
. Should I rename repr_attributes
too? (Or maybe we should document this attribute, so that custom token classes could use it too.)
To avoid repetition and to have the tests a little bit more "outside change resistant", here are some ideas for improvements: a) pass only what is different to the helper check_repr_prefix() method and check the presence of at 0x... in it explicitly,
I'm not exactly sure how the change would look like, can you give me an example?
I was thinking about replacing for example this:
self.check_repr_prefix(doc, "<mistletoe.block_token.Document with 1 child at 0x")
... with this:
self._check_repr_matches(doc, "block_token.Document with 1 child")
... while the _check_repr_matches()
would be the original method which would newly also check the common prefix and suffix is present.
b) with a few exceptions, for example span tokens are always searched inside parsed
doc.children[0].children[0]
-> introduce for examplecheck_span_token_repr()
orget_span_token(doc)
to have this fact just at 1 place?Unfortunately, the span token that the test extracts is not always
doc.children[0]
. Sometimes it'sdoc.children[1]
and sometimes it's a child token ofdoc.children[0]
. So I'm not sure, how we could accomplish moving the access to the span tokens out into a function.But maybe we could add a
__getitem__
totoken.Token
, so thatdoc.children[0].children[0]
could be written asdoc[0][0]
but unfortunately that doesn't mesh with what__contains__
does. (I would expecta in b
to be the same asany(c == a for c in b)
).
Well, I'm not sure now really. :) I thought some of the testing inputs of span tokens could be changed, so that doc.children[0].children[0]
is nearly always used. But I don't want to bother you too much with this, I think we can live with how it is coded for now and possibly do some refactoring later on if really needed...
BTW We should probably have all those helper (not-really-public) methods prefixed with
_
, but that is just a minor issue I guess...I've renamed
short_repr
to_short_repr
. Should I renamerepr_attributes
too? (Or maybe we should document this attribute, so that custom token classes could use it too.)
OK, good question. I agree that documenting the repr_attributes
field would not be bad. I've done research on how to document class attributes in Python - it looks like putting a simple string literal beneath it is sufficient (see SO answer). BTW You could also move the Base token class.
pydoc after the class Token
line when at it, I believe that would be more correct.
I was thinking about replacing for example this:
self.check_repr_prefix(doc, "<mistletoe.block_token.Document with 1 child at 0x")
... with this:
self._check_repr_matches(doc, "block_token.Document with 1 child")
... while the
_check_repr_matches()
would be the original method which would newly also check the common prefix and suffix is present.
OK, done.
OK, good question. I agree that documenting the
repr_attributes
field would not be bad. I've done research on how to document class attributes in Python - it looks like putting a simple string literal beneath it is sufficient (see SO answer). BTW You could also move theBase token class.
pydoc after theclass Token
line when at it, I believe that would be more correct.
OK, I've added a docstring to token.Token
. I've included the documentation for Token.repr_attributes
in the class docstring, since it's a little bit longer and the builtin help()
function doesn't pick up the "attribute docstring", if I put it after the attribute. I.e. when I use the following Token
class:
class Token:
"""
Base token class.
Token has two subclasses:
* block_token.BlockToken, for all block level tokens. A block level token
is text which occupies the entire horizontal width of the "page" and is
offset for the surrounding sibling block with line breaks.
* span_token.SpanToken. for all span-level (or inline-level) tokens.
A span-level token appears inside the flow of the text lines without any
surrounding line break.
"""
repr_attributes = ()
"""
Custom __repr__ methods in subclasses: The default __repr__ implementation
outputs the number of child tokens (from the attribute children) if
applicable, and the content attribute if applicable. If any additional
attributes should be included in the __repr__ output, this can be specified
by setting the class attribute repr_attributes to a tuple containing the
attribute names to be output.
"""
the test script
from mistletoe import token
help(token.Token)
gives me:
Help on class Token in module mistletoe.token:
class Token(builtins.object)
| Base token class.
|
| Token has two subclasses:
|
| * block_token.BlockToken, for all block level tokens. A block level token
| is text which occupies the entire horizontal width of the "page" and is
| offset for the surrounding sibling block with line breaks.
|
| * span_token.SpanToken. for all span-level (or inline-level) tokens.
| A span-level token appears inside the flow of the text lines without any
| surrounding line break.
|
| Methods defined here:
|
| __repr__(self)
| Return repr(self).
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| repr_attributes = ()
BTW, do we want to use ReST markup for the docstrings, or Markdown or do we want to keep the docstring in plain text?
Thank you, I think it is OK as you did it.
OK.
Yet, just a note, if we can believe what is written on the linked page, this (i. e. docstrings below attributes) is not implemented in Python "core" (like help()) for some reason,
The reason for that is that normal docstrings will be added as a __doc__
attribute to the documented object (module, class, function) by the compiler. So showing such a doc string can simply be done via introspection. But for an "attribute docstring" (which is simply a string constant after the documented attribute) the string will vanish from the compiled code (as there is no reliable way to attach the docstring to anything), so for tools to be able to show such a string, they would have to find the module source code, load it (using the correct text encoding), and correlate the source code with the compiled module content. This is several orders of magnitude more complicated than just doing print(thing.__doc__)
.
but it is de-facto standard supported by editors like VS Code and various documentation generators. So I wouldn't personally be afraid of using it in the future.
The linked article states that Sphinx supports it too, but I couldn't get it to work. I tried it in another project with the .. automodule::
directive, and the attribute doesn't show up. Even if I use both the .. automodule::
and the .. autoattribute::
directive it doesn't work.
BTW, do we want to use ReST markup for the docstrings, or Markdown or do we want to keep the docstring in plain text?
Again, I'm not an expert, but I think "Markdown with ReST extras" is just fine here, while keeping some consistency across the code. BTW, you can see a recent discussion on how to use single/ double backticks here.
OK, if you want, I can create another pull request that updates the docstrings to Markdown. But I believe that mixing Markdown and ReST might be problematic. Backticks work differently for example:
In Markdown single backticks are used for computer code, and you can surround your computer code with multiple backticks if your computer code includes backticks itself (which doesn't seem to work if the backticks are at the beginning and end of the code, and Github doesn't seem to support this, so I can't show you an example ;)).
In ReST generic computer code is marked up with double backticks. Single backticks are used for "interpreted text roles" which look like this:
:class:`token.Token`
(see https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html for interpreted text roles in Sphinx.)
If you omit the role name (class
in the above example), ReST/docutils/Sphinx will use a default text role (which for Sphinx can be configured via default_role
(see https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-default_role).
So IMHO it would be better to convert the docstring to ReST.
@doerwalter, thanks for your analysis, it is an interesting topic, although we got a little bit outside of the scope of this issue... ;)
... The reason for that is that normal docstrings will be added as a
__doc__
attribute to the documented object (module, class, function) by the compiler. So showing such a doc string can simply be done via introspection. ... This is several orders of magnitude more complicated than just doingprint(thing.__doc__)
.
I see, the reason is mentioned in the linked SO page, the reasons behind deciding not to add __doc__
to attributes as well are more blurry though, but I guess we can't do anything about it here. ;) In short, I use primarily what VS Code parses (possibly "going against Green ICT", but maybe they would do extra parsing anyway, I don't know) and hints me anyway, so no big deal of using docstring at attributes, yet I can imagine someone like you may need to see the complete docs elsewhere too. So OK, let's leave it in the class docstring for now.
... OK, if you want, I can create another pull request that updates the docstrings to Markdown. But I believe that mixing Markdown and ReST might be problematic. ...
So IMHO it would be better to convert the docstring to ReST.
Yes, as you and the linked comment state, when choosing ReST, simply use doubled backticks for any code and single backticks for code which we would like to have also linked to an existing code construct. So I would keep it this way, it is used in some other code already. That said, I'm not quite sure if also adding a "role" before single backticks really pays off, I'm rather against it in order to KISS, but if you persuade me it really has some great benefits, maybe I would change my mind... :)
OK, so should we merge this pull request, so I can open another one for adding proper markup to the docstrings, or should I update the pull request to add that? (But probably just to the new docstrings, not to the existing ones, which, I guess, is out of scope for this pull request.)
Hi, I would choose to update just the new docstrings here. And if you agree, I would select "Squash and merge" to merge this PR afterwards, I think it is good for situations like this one (I haven't tried it here on GitHub before).
OK, I've updated the docstrings in token.py
.
OK, merged, thank you! :)
This pull request adds a
__repr__
method to all token classes. To simplify implementation it introduces a common base classtoken.Token
forblock_token.BlockToken
andspan_taken.SpanToken
. For the following example program:The output looks like this without
__repr__
methods:and like this with
__repr__
methods: