Unescaped markup breaks publishing

spyoungtech commented 7 years ago

I ran into an issue attempting to publish an rst document which, in relevant part, contains a list that has url's that contain placeholder text (IE <placeholder>) which resulted in a ConfluenceBadApiError exception.

Checking the response returned by Confluence:

{"statusCode":400,"data":{"authorized":false,"valid":true,"errors":[],"successful":false},"message":"Error parsing xhtml: Unexpected close tag </li>; expected </project-name>.\n at [row,col {unknown-source}]: [139,67]"}

Sample of the RST content:

Deployment
----------

We will discuss this topic in depth,
but here are some links for your further reference
when the class is complete:

* It should be possible to export the contents of ``_build/html``
  to any file-system-based web service and serve it as static content.

* You can package the documentation in a ZIP file
  and upload it using the “edit” page for your Python package,
  and it will appear at the URL:

  http://pythonhosted.org/<project-name>

  Detailed instructions for this procedure live at:

  http://pythonhosted.org/<project-name>

* The powerful and popular Read the Docs service
  lets you configure your GitHub repository
  so that every time you push a new version of your software,
  the documentation gets automatically rebuilt
  and made available at:

  https://readthedocs.org/projects/<project-name>/

  Read the Docs also supports custom host names
  if you want your documentation to appear beneath your own
  project sub-domain.
  More information is available at:

jdknight commented 7 years ago

@spyoungtech, thanks for the sample case. Definitely an issue handling the greater-than/less-than characters. Both REST and XML-RPC will fail.

I tried making a quick fix escaping some characters, however contentbody/convert/storage (and convertWikiToStorageFormat) appear to unescape the data again before a storage request is made. So, there may be a need to two-passes to escape some characters (from just a quick look at this). When this issue can be looked at, it would be good to check for additional characters which could cause this issue as well.

tonybaloney commented 7 years ago

thanks for filing the issue @spyoungtech I'll hopefully get around to reviewing this next week

tonybaloney commented 7 years ago

This is a tricky one, I'm reading the spec and <, > are reserved characters so it's not clear whether your document is even valid RST

http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup-recognition-rules

I'm reluctant to just put a blanket replace method in for those characters since they can be used for other signs and it'll just create new bugs.

Another option could be to use a replace method only within the Sphinx builder for paragraphs. I don't know if that helps you because it's a list item

spyoungtech commented 7 years ago

Thanks for the response @tonybaloney -- I hadn't considered that it may technically be invalid rst though I'm not entirely clear from reading the spec, myself. Sphinx does, however, process this material fine, for example, when building to html, like for readthedocs.

When built to html, the 'offending' < and > appear to change to < and > respectively.

<a class="reference external" href="http://pythonhosted.org">http://pythonhosted.org</a>/&lt;project-name&gt;</p>

Based on some related issues, it looks like there are other troublesome characters that confluence does not like.

I agree that a blanket replace is not the way to go. When I clear off some other things from my plate at work, I'll be able to use my paid time to read up on the source code, investigate, and hopefully contribute some code or ideas.

For your reference, I am trying to publish the material contained in the Brandon Rhodes sphinx-tutorial -- this content comes out of the 'handout' directory.

tonybaloney commented 7 years ago

I've written a fix that replaces certain characters inside a paragraph. I'll push a PR and if you would be able to test it or share

tonybaloney commented 7 years ago

this fix does not fix the reported issue :-(

I'm still not sure what Sphinx is supposed to do because you can't have triangular brackets in a URL. Looking at the Github renderer, it shows a URL and then a piece of text, so it separates the 2. @brandon-rhodes tutorial looks great and we should support it but I'm not sure what the behaviour should be

tonybaloney commented 7 years ago

this is what github renders the RST as

<p><a href="http://pythonhosted.org">http://pythonhosted.org</a>/&lt;project-name&gt;</p>

brandon-rhodes commented 7 years ago

Thanks for pointing out the problem! Since those are not valid URLs, they should not be bare text in the RST document. I have just committed a fix that turns them into code literals, since they are samples and cannot literally be clicked on and followed. Let me know if this resolves the problem!

spyoungtech commented 7 years ago

What Brandon says makes perfect sense. The URLs in this example should be literals, rather than bare text.

@tonybaloney this is the behavior I'm seeing now after pulling down the changes in #43

I see the text replacement in the produced .conf file (this example is after pulling Brandon's changes to the tutorial repo)

{{http://pythonhosted.org/&lt;project-name&gt;}}

But then after it goes through the confluence storage converter, it appears that, as @jdknight mentioned earlier, confluence unescapes this in the storage format to

<code>https://readthedocs.org/projects/<project-name>/</code>

This is the same result prior to #43 (IE happens in both cases)

Therefore, it still throws the bad api error because of the unexpected "tag". What's interesting to me is that the API is returning invalid storage format content. To me, this seems more of a bug with Atlassian's API. Wondering if there is an open issue on this in Atlassian's bug tracker.

This leads me to believe that a sanitization step is needed after the storage format conversion, unless it's possible to produce some format that won't produce this problem when run through the conversion API...

I'm going to look at this some more this week.

spyoungtech commented 7 years ago

Also worth mentioning that, if I take the product of the .conf file (as produced before #43) and manually insert this markup into confluence, it works as expected. However, I noticed that it uses a tinymce rest endpoint to do the conversion at https://<BASE_URL>/rest/tinymce/1/wikixhtmlconverter

This does the right thing, whereas the Confluence REST API does not. I'm running Confluence Server 6.2.2

screen shot 2017-08-21 at 12 33 00 pm

jdknight commented 7 years ago

This issue should now be resolved in db5b916e57898eb86244b4c33a801f5da5c3b984 (available on the master; see specifically 255043422d25f8e437baf3182eafa6464a1ab288). To take advantage of this change, one can pull from master. This change should be made available in an upcoming v0.7 release (#29).

I'll be marking this as closed as I believe the root issue has been resolved. If there's another issue I have missed, please feel free to bring it up in another issue.

sphinx-contrib / confluencebuilder

Unescaped markup breaks publishing #39