textile / python-textile

A Python port of Textile, A humane web text generator
Other
69 stars 23 forks source link

Problems with ":" inside quotes #27

Closed rcarmo closed 8 years ago

rcarmo commented 8 years ago

I've recently upgraded from 2.2.2 to 2.3.1, and out of the 7000-odd documents on my site, a few hundred of them are breaking the new parser.

After fiddling about with it on Jupyter notebook, I isolated the following test case:

from textile import Textile
t = Textile(html_type="html5")
buffer="""
* Folders with ":" in their names are displayed with a forward slash "/" instead. (Filed as "#4581709":Radar:4581709, which was considered "normal behaviour" - quote: "Please note that Finder presents the 'Carbon filesystem' view, regardless of the underlying filesystem.")

"""
t.parse(buffer)

This blows up with the following traceback:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-27ca3ebfabb8> in <module>()
----> 1 t.parse(buffer)

/Users/rcarmo/.pyenv/versions/2.7.10/lib/python2.7/site-packages/textile/core.pyc in parse(self, text, rel, sanitize)
    249                         'pre', 'h[1-6]',
    250                         'fn{0}+'.format(regex_snippets['digit']), '###']
--> 251                 text = self.block(text)
    252                 text = self.placeNoteLists(text)
    253         else:

/Users/rcarmo/.pyenv/versions/2.7.10/lib/python2.7/site-packages/textile/core.pyc in block(self, text)
    454                 whitespace = ' \t\n\r\f\v'
    455                 if ext or not line[0] in whitespace:
--> 456                     block = Block(self, tag, atts, ext, cite, line)
    457                     if block.tag == 'p' and not has_raw_text(block.content):
    458                         line = block.content

/Users/rcarmo/.pyenv/versions/2.7.10/lib/python2.7/site-packages/textile/objects/block.pyc in __init__(self, textile, tag, atts, ext, cite, content)
     27         self.inner_atts = OrderedDict()
     28         self.eat = False
---> 29         self.process()
     30 
     31     def process(self):

/Users/rcarmo/.pyenv/versions/2.7.10/lib/python2.7/site-packages/textile/objects/block.pyc in process(self)
    119 
    120         if not self.eat:
--> 121             self.content = self.textile.graf(self.content)
    122         else:
    123             self.content = ''

/Users/rcarmo/.pyenv/versions/2.7.10/lib/python2.7/site-packages/textile/core.pyc in graf(self, text)
    580 
    581         text = self.getRefs(text)
--> 582         text = self.links(text)
    583 
    584         if not self.noimage:

/Users/rcarmo/.pyenv/versions/2.7.10/lib/python2.7/site-packages/textile/core.pyc in links(self, text)
    601         does not match a trailing parenthesis.  It gets caught by tail, and
    602         we check later to see if it should be included as part of the url."""
--> 603         text = self.markStartOfLinks(text)
    604 
    605         return self.replaceLinks(text)

/Users/rcarmo/.pyenv/versions/2.7.10/lib/python2.7/site-packages/textile/core.pyc in markStartOfLinks(self, text)
    649                         if re.search(r'\S$', possibility, flags=re.U): # pragma: no branch
    650                             balanced = balanced + 1
--> 651                         possibility = possible_start_quotes.pop()
    652                     else:
    653                         # If quotes occur next to each other, we get zero

IndexError: pop from empty list

This is fairly common markup in my docs, too. I often quote non-alphanumeric stuff for emphasis. Also, what happened to the head_offset parameter to parse()? I see no mention of it anywhere. Is there a changelog someplace?

ikirudennis commented 8 years ago

Looking into your two issues now. And yeah, sorry about not putting the removal of head_offset in the changelog. The change was made in d4ac0f5. It's no longer a part of php-texile, and I think it's been gone from there for a while.

ikirudennis commented 8 years ago

I think I've got this working, I just need to confirm that I'm generating the correct expected output. Here's the inner content of that li:

Folders with &#8220;:&#8221; in their names are displayed with a forward slash &#8220;/&#8221; instead. (Filed as <a href="Radar%3A4581709">#4581709</a>, which was considered &#8220;normal behaviour&#8221; &#8211; quote: &#8220;Please note that Finder presents the &#8216;Carbon filesystem&#8217; view, regardless of the underlying filesystem.&#8221;)
rcarmo commented 8 years ago

Those are smartypants quotes, right? I think the href shouldn't be URL-encoded (I rely on custom URL schemas for post-processing).

2.2.2 gives me this (including the li), which is otherwise the same:

u'\t<ul>\n\t\t<li>Folders with &#8220;:&#8221; in their names are displayed with a forward slash &#8220;/&#8221; instead. (Filed as <a href="Radar:4581709">#4581709</a>, which was considered &#8220;normal behaviour&#8221; &#8211; quote: &#8220;Please note that Finder presents the &#8216;Carbon filesystem&#8217; view, regardless of the underlying filesystem.&#8221;)</li>\n\t</ul>'
ikirudennis commented 8 years ago

Testing the Radar link against txstyle.org, it seems php-textile doesn't handle it very well: <p>Filed as &#8220;#4581709&#8221;:Radar:4581709, which was considered &#8220;normal behaviour&#8221;</p> It's not even interpreted as a link. We're failing a little more gracefully in this case.

Does your post-processing function with the encoded Radar url? If not, I can help you work out a way around this. Open a new issue about it and we'll put our heads together about subclassing Textile to get you what you need.

Otherwise, should I close this?

rcarmo commented 8 years ago

Hmm. Yeah, but we need to sort out the URL stuff separately - I've been using textile since 2.0 and used PHP before rebuilding my site atop Python, and custom URL schemas were never a problem. In fact, just to give you an idea, this is the list of custom schemas I've been using for around ten years.

Right now, my current solution is completely markup-agnostic: I have Markdown, reST and other renderers working, and they allow for custom schemas (including Textile 2.2.x), so if 2.3.x is URL-encoding the whole URL, it's diverging from what I perceive as the norm...