reStructuredText in docstrings not converted to Markdown

thomasjm commented 4 years ago

Some Python docstrings use reStructuredText, one notable example being pandas. Here's an example from that link:

def add(num1, num2):
  """
  Add up two integer numbers.

  This function simply wraps the `+` operator, and does not
  do anything interesting, except for illustrating what is
  the docstring of a very simple function.

  Parameters
  ----------
  num1 : int
      First number to add
  num2 : int
      Second number to add

  Returns
  -------
  int
      The sum of `num1` and `num2`

  See Also
  --------
  subtract : Subtract one integer from another

  Examples
  --------
  >>> add(2, 2)
  4
  >>> add(25, 0)
  25
  >>> add(10, -10)
  0
  """
  return num1 + num2

This language server feeds these docstrings back directly to the client without any transformation into Markdown as expected by the LSP spec, so they don't always render well when you do things like hovers. For example, the code block sections beginning with >>> don't look good.

The Microsoft language server has special parsing code in it to handle this; see https://github.com/microsoft/python-language-server/blob/master/src/LanguageServer/Impl/Documentation/DocstringConverter.cs.

Note: I copied this with from a very similar issue I filed on https://github.com/palantir/python-language-server/issues/760

pappasam commented 4 years ago

Thanks for opening this issue!

Note: I've chosen to "resolve" the issue by explicitly declaring the MarkupKind to be "plaintext". Since Python docstring convention varies widely across projects, I've decided that attempting to convert all docstrings into Markdown would be futile. See the relevant part of the LSP spec here: https://microsoft.github.io/language-server-protocol/specification#markupContent

My decision regarding the task's futility comes from some light prototyping on my end and my reading of this issue (along with several others related to markdown-conversion complexity): https://github.com/microsoft/python-language-server/issues/613

Since Python docstrings are generally plaintext, I think this is a reasonable solution for now. If you come up with any ideas that can be implemented simply, safely, and without any bloated 3rd party dependencies, I'd definitely consider them!

pappasam commented 4 years ago

After doing some manual experimentation with my language client, it looks like I can easily do the following: return a markdown result, rendering the function signature wrapped in a python language fence and the docstring itself wrapped in plain text.

I'll spend a bit of time making sure we stay compatible with language clients that may not prefer markdown.

I think this is the simplest compromise while still benefiting somewhat from markdown syntax highlighting capabilities

pappasam commented 4 years ago

I did implement the above, but my implementation didn't provide much eye candy, but it did introduce some additional complexity to the codebase. Given Python's lack of standard docstrings / everything being an object and all, the implementation mentioned here https://github.com/pappasam/jedi-language-server/issues/22#issuecomment-629852656 and already released is probably the least error-prone / most correct we can achieve. Happy to review PR's, and please open another issue if you can think of a good path forward involving markdown.

thomasjm commented 4 years ago

Hi @pappasam , thanks for jumping on this so quickly. It's definitely a tricky one due to good RST to Markdown converters being hard to find.

A few thoughts:

IMO the solution in https://github.com/pappasam/jedi-language-server/pull/23/commits/44292b95d275a56ba856bfe925ca375f211cb769 of removing all formatting is worse than the status quo, since RST and Markdown have a lot of similarities and some of it renders fine. Partial eye candy is better than no eye candy to me, even if some of it is wonky.
I'm not sure I agree that Python has a lack of a standard for docstrings. Nominally at least https://www.python.org/dev/peps/pep-0287/ seems to say that RST is the standard. The major projects like numpy and pandas all seem to use it from what I've seen.
One thing I'm thinking of trying is configuring jedi-language-server to accept a parameter which is a path to a pandoc binary, which can be used to shell out to do the conversion. It's a large and also GPL-ed dependency so I don't think you could bundle it directly, but at least users would be able to use it if they already have it.
Another thing I've put some work into is directly translating Microsoft's conversion class from C# to Python. I tried it a little bit and it seemed to work but it likely still needs some polish. You can see it here: https://gist.github.com/thomasjm/e509c39a02168384c91b470d92418d1c. Although looking at the issue you linked, their converter seems to have problems too.
One final thing I've considered but not tried is using Sphinx to do the conversion.

thomasjm commented 4 years ago

Here's a commit showing how I modified it to use pandoc when available; seems to work fine:

https://github.com/thomasjm/jedi-language-server/commit/b8f760912f4958e2641a036f8f879ddb4970a343

pappasam commented 4 years ago

@thomasjm thanks for the example! I'll meditate more on this. In general, to me, bad syntax highlighting is worse than none at all. I had previously disabled all syntax highlighting for hover windows in my client, but this issue helped me figure out how to special case just Python to plaintext (so thank you!).

That said, worst case, I'll make things configurable for you with an initializationOption so both of us will be happy. Best case, the pandoc solution or something like it will work!

thomasjm commented 4 years ago

Sounds good!

There's a little more progress on my branch which you can see here https://github.com/pappasam/jedi-language-server/compare/master...thomasjm:master

pappasam commented 4 years ago

@thomasjm update: version 0.15.0 supports markdown by default if your client supports it, but is configurable to use plaintext. Basically brings back the pre-0.14 behavior that you preferred. Should be a decent band-aid until we figure out the right markdown transformation solution.

thomasjm commented 4 years ago

Thanks! FYI I've been playing with Pandoc a bit more and found a few issues involving inconsistencies with how Pandoc handles the RST in different packages. I'm trying to tweak it to behave well in all cases. Interesting things to compare include

hovering "numpy" in import numpy
hovering "DataFrame" in df = pandas.DataFrame()
hovering the word "import" in any import

and seeing how things like RST doctest blocks render, as well as more complex literal blocks.

pappasam commented 4 years ago

Yes, I've noticed the same thing. Since it's quite challenging to identify whether a docstring is formatted in rst, markdown, or "other", and pandoc does not appear to have a "auto-language-detection" feature, using pandoc would require lots of special casing on a library-by-library basis. I think that would introduce lots of churn to this project. Whichever solution we end up on, it shouldn't involve knowing which 3rd party library we're providing a Hover response for.

This leaves the "direct translation of the Microsoft function" as our most viable candidate, right? IMHO, our goal here is to create one function that takes any string and converts aspects within it that are identified as RST that really muck up markdown formatting and returns a cleaned string ready for markdown rendering.

This function can live in its own module with tons of helper functions, but as long as the final interface is Callable[[str], str], I should be able to easily swap it into the language server.

pappasam commented 4 years ago

This library could be promising: https://github.com/dadadel/pyment

thomasjm commented 4 years ago

This leaves the "direct translation of the Microsoft function" as our most viable candidate, right?

I don't know, I have a feeling this will never be 100% correct--a casual scan of the issues on the MS language server showed difficulties parsing more complex RST constructs. It doesn't make any effort to auto-detect the docstring's actual format, it just tries to parse something RST-like. I kind of think of it as a buggy/incomplete version of what Pandoc does, albeit maybe better tuned for docstring issues. And potentially a maintenance nightmare to keep tuning.

(I remain confused about the discrepancies between mainstream libraries. Even between Numpy and Pandas, when Pandas says that its own format is "based on the Numpy Docstring Standard"...)

On the other hand, https://github.com/dadadel/pyment looks really good! Especially the part where it says it auto-detects different styles.

pappasam commented 4 years ago

@thomasjm I spent a good chunk of my day yesterday trying different tools / tricks to resolve this issue and have decided this is too challenging to solve at present.

Things I've tried:

https://github.com/PavloKapyshin/paka.cmark
https://github.com/dadadel/pyment
https://github.com/limodou/par
Converting rst to html with docutils then html to markdown with this tool: https://github.com/aaronsw/html2text
Simple escaping of certain annoying rst elements

Based on my efforts, I think the diversity of docstring in Python projects is too great to solve this in general. Going forward, I'll spend my energy resolving other issues, but I'm open to reviewing any PR's that you think might generally solve the problem!

thomasjm commented 4 years ago

@pappasam got it -- I was thinking of taking a stab at pyment, can you tell me what you tried? Thanks!

pappasam commented 4 years ago

For pyment, used this class: https://github.com/dadadel/pyment/blob/master/pyment/docstring.py#L1163

You'll need to pass the lines of code, using get_line_code method on the name returned from jedi_script.help, along with anything else you'd like to pass to the constructor. For example, you can experiment with the different output_style options, passing docs_raw (or not), etc. The new docstring can be generated / retrieved with the get_raw_docs method.

Note: none of the above got me close to solving this issue in a general sense, so valid solutions (if they exist, which is dubious IMHO) probably will require a different approach.

krassowski commented 3 years ago

Ok, I tried pyment and pypandoc too. They are not up to task. I hand-crafted https://github.com/krassowski/docstring-to-markdown for now. It's does the job just about right for >90% of pandas and numpy documentation entries as retrieved by jedi. It is conservative in the way that if if cannot confirm that it is in reStructuredText that it knows it raises an error. I plan to add Google docstrings handling too. Just wanted to keep you in the loop. I originally written it in typescript but reverted to Python later today.

pappasam commented 3 years ago

@krassowski amazing, I can't wait to play around with this!

krassowski commented 3 years ago

Please do let me know if you encounter any issues. Here is a work-in-progress for some documentation formatting (it will be rewritten to use the signatures list rather than signatures string which is troublesome, yet I initially went for it as it is being cached by jedi for numpy, pandas and friends).

pappasam / jedi-language-server

reStructuredText in docstrings not converted to Markdown #22