textile / python-textile

A Python port of Textile, A humane web text generator
Other
69 stars 22 forks source link

AttributeError: 'NoneType' object has no attribute 'groups' #37

Closed barsch closed 7 years ago

barsch commented 7 years ago
>>> import textile
>>> textile.__version__
'2.3.6'

works as expected:

>>> textile.textile('# xxx\n# yyy\n *blah*')
u'\t<ol>\n\t\t<li>xxx</li>\n\t\t<li>yyy\n <strong>blah</strong></li>\n\t</ol>'

removing space between \n and *blah*, raises an error

>>> textile.textile('# xxx\n# yyy\n*blah*')
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\Python\virtualenv\egu\django19\lib\site-packages\textile\core.py", line 1360, in textile
    return Textile(html_type=html_type).parse(text)
  File "D:\Python\virtualenv\egu\django19\lib\site-packages\textile\core.py", line 250, in parse
    text = self.block(text)
  File "D:\Python\virtualenv\egu\django19\lib\site-packages\textile\core.py", line 466, in block
    block = Block(self, tag, atts, ext, cite, line)
  File "D:\Python\virtualenv\egu\django19\lib\site-packages\textile\objects\block.py", line 29, in __init__
    self.process()
  File "D:\Python\virtualenv\egu\django19\lib\site-packages\textile\objects\block.py", line 123, in process
    self.content = self.textile.graf(self.content)
  File "D:\Python\virtualenv\egu\django19\lib\site-packages\textile\core.py", line 602, in graf
    text = self.textileLists(text)
  File "D:\Python\virtualenv\egu\django19\lib\site-packages\textile\core.py", line 297, in textileLists
    return pattern.sub(self.fTextileList, text)
  File "D:\Python\virtualenv\egu\django19\lib\site-packages\textile\core.py", line 312, in fTextileList
    tl, start, atts, content = m.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
ikirudennis commented 7 years ago

This is an interesting situation. When I test it against php-textile, the output looks wrong:

<p><ol>
    <li>xxx</li>
    <li>yyy</li><br />
</ol><br />
<strong>blah</strong></p>

I'm assuming the output you're expecting would be the same except the second li is: yyy<br />\n<strong>blah</strong> Is that correct? For the record, you can achieve the same output by using the strong tags in the input: # xxx\n# yyy\n<strong>blah</strong> I hope that helps for the time being.

For the time being, I'm going to shoot for matching textile as it currently is, but I'll forward this bug along to them as well.

netcarver commented 7 years ago

Sounds similar to this php-textile issue which has been known for a while but has not been solved to our satisfaction yet.

barsch commented 7 years ago

@ikirudennis: I don't care to much about the output as long it doesn't raises an exception - one could argue for both cases but I'm no textile expert ...

ikirudennis commented 7 years ago

@barsch, if you need a quick solution, you can use the develop branch, and it won't raise an error for you.

The change I made works for now, but it's a little quick-and-dirty... more concerned with not raising an error than with producing correct output. I'm going to think on it, and look into the underlying issue in php-textile to see if I can reason out something better.

barsch commented 7 years ago

thanks for the quick fix

jarshwah commented 7 years ago

FYI - we're seeing the same issue with the following string on 2.3.7:

u'*Highlights*\n\n* UNITEK Y-3705A Type-C Universal DockingStation Pro\n* USB3.0/RJ45/EARPHONE/MICROPHONE/HDMI 6 PORT HUB 1.2m Data Cable 5V 4A Power Adaptor\n*\n* Dimensions: 25cm x 13cm x 9cm\n* Weight: 0.7kg'

Removing the extraneous \n*(After "Power Adapter") fixes this particular string.

ikirudennis commented 7 years ago

@jarshwah I ran your test through php-textile and got some strange output, so it looks like your example may just be incorrect textile formatting. If you're looking to have that blank <li> the following would be a good way to accomplish that:

*Highlights*

* UNITEK Y-3705A Type-C Universal DockingStation Pro
* USB3.0/RJ45/EARPHONE/MICROPHONE/HDMI 6 PORT HUB 1.2m Data Cable 5V 4A Power Adaptor
* &nbsp;
* Dimensions: 25cm x 13cm x 9cm
* Weight: 0.7kg

That produces the following in both php- and python-textile:

<p><strong>Highlights</strong></p>

<ul>
    <li><span class="caps">UNITEK</span> Y-3705A Type-C Universal DockingStation Pro</li>
    <li>USB3.0/RJ45/EARPHONE/MICROPHONE/HDMI 6 <span class="caps">PORT</span> <span class="caps">HUB</span> 1.2m Data Cable 5V 4A Power Adaptor</li>
    <li>&nbsp;</li>
    <li>Dimensions: 25cm x 13cm x 9cm</li>
    <li>Weight: 0.7kg</li>
</ul>

Is that more like what you're expecting?

jarshwah commented 7 years ago

@ikirudennis the text was wrong. The blank li always had a floating literal * before 2.3.0, but now crashes instead. The text being rendered comes from an API so it's not so straight forward to just change the text. I've resorted to wrapping the call to textile in a try/catch and rendering the text verbatim on failure.

I don't know what the solution should be for python-textile. On one hand, crashing on bad input isn't a bad idea. On the other, bad input bringing a production site down (because that's where we get our interesting text from) isn't fun either.

If this was my own project, and I wasn't aiming for compatibility with another project, I'd probably add a strict=True arg and default to crashing. But that's easy to suggest from the sidelines :)

ikirudennis commented 7 years ago

Firstly, thanks again for your response. I always feel bad when this thing causes users to create workarounds

I'm not disagreeing with anything you've said. A third-party text-processing package shouldn't crash a production system due to bad input. It should fail a little more gracefully. And your idea of a strict option is a good one. I'll see about adding that in. And I think it would be helpful to add a note in the README about running old textile through the strict mode in a testing environment before it rudely takes down production. </notetoself>

That being said, figuring out the appropriate way to gracefully fail is the real trick here. I've run your example through textile v2.2.2 and it produces the following:

    <p><strong>Highlights</strong></p>

    <ul>
        <li><span class="caps">UNITEK</span> Y-3705A Type-C Universal DockingStation Pro</li>
        <li><span class="caps">USB</span>3.0/RJ45/EARPHONE/MICROPHONE/HDMI 6 <span class="caps"
>PORT</span> <span class="caps">HUB</span> 1.2m Data Cable 5V 4A Power Adaptor</li>
    </ul>
*
    <ul>
        <li>Dimensions: 25cm x 13cm x 9cm</li>
        <li>Weight: 0.7kg</li>
    </ul>

Maybe not my cup of tea, but it is valid html nonetheless. txstyle.org adds a <br /> onto the asterisk. Do you have any preference? I'll see what I can cook up, given that I'm giving the list section of this a complete overhaul. And the plan is to push my overhaul up to php-textile.

I've also run it through a few different markdown parsers (since the list format is the same), and I get different results on each of them. Some treat the asterisk as text and part of the list item content of the item above it, others treat it as just an empty list item like my &nbsp; hack. This illustrates one of the things I dislike about markdown: everyone's got their own flavor of it. That's why I try to aim for matching what the php version does, even if I disagree with its output sometimes.

I'm going to apologize about version 2.3.8. While it doesn't crash, it ignores the part of the list after the lone asterisk. One step forward, two steps back. So, for you, I wouldn't recommend using it. I'll see what I can do, but I'm just going to say that I wish I had more time I could devote to this project. Please bear with me.

ikirudennis commented 7 years ago

@jarshwah, I stumbled upon an interesting development, which may or may not help you. When I test your example, as I mentioned before, it doesn't error out, but ignores everything after the lone asterisk. However, if I run it through the pytextile cli tool, it produces something much closer to what you may be looking for:

    <p><strong>Highlights</strong></p>

    <ul>
        <li><span class="caps">UNITEK</span> Y-3705A Type-C Universal DockingStation Pro</li>
        <li>USB3.0/RJ45/EARPHONE/MICROPHONE/HDMI 6 <span class="caps">PORT</span> <span class="caps">HUB</span> 1.2m Data Cable 5V 4A Power Adaptor<br />
*</li>
        <li>Dimensions: 25cm x 13cm x 9cm</li>
        <li>Weight: 0.7kg</li>
    </ul>

So, maybe you can change your catch to run it through the cli tool on error? Does that help? I was thinking that there's also a chance that the following may clear resolve the issue as well:

# assuming the test text is textile_input
if '\n*\n*' in textile_input:
    textile_input.replace('\n*\n*', '\n* &nbsp;\n*')

Please let me know either way if that helps or hurts things.

jarshwah commented 7 years ago

Thanks for the follow up @ikirudennis I appreciate the thought you're putting into this.

Regarding your workarounds, I'm probably unlikely to call out to the cli tool. The text replacement is more my cup of tea - and something I could definitely incorporate as pre-validation within the try/catch. I'm still going to wrap in a try/catch to guard against future issues, but that's probably good practise anyway.

Maybe not my cup of tea, but it is valid html nonetheless. txstyle.org adds a
onto the asterisk. Do you have any preference?

As you've pointed out, the output isn't going to be nice either way, because the input is fundamentally wrong. I think I'd slightly prefer a &nbsp; since that is likely the intended result, but as long as the text before and after the empty li is preserved as a list (or two lists) I think that's ok.

This illustrates one of the things I dislike about markdown: everyone's got their own flavor of it. That's why I try to aim for matching what the php version does, even if I disagree with its output sometimes.

Very reasonable. Especially if you're planning to (try to) upstream what you think are good changes.

I'll see what I can do, but I'm just going to say that I wish I had more time I could devote to this project. Please bear with me.

The unglamorous parts of OSS :) I'd just like to say that whatever pace you approach this with is fine by me. I'm in no hurry (we can stay on an older version or workaround with approaches given), so there's no pressure coming from my side. Happy to help out where I can.