Channel detection in posts is over-eager and breaks links that contain # character

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Post a link like: http://en.wikipedia.org/wiki/Free_software#Definition
2.
3.

What is the expected output? What do you see instead?
This should be treated as a link to a particular section of a wikipedia page.
Instead the part of the link saying #Definition has been treated as a
reference to a channel called #Definition rather than as part of the link

You can see an example here:
http://openku.appspot.com/user/openku/presence/a43026ffb79549d6b2d3b4fca5a64dc0

Please use labels and text to provide additional information.

Original issue reported on code.google.com by adewale on 28 Mar 2009 at 7:37

GoogleCodeExporter commented 8 years ago

#Channel and @actor replacement is done after markdown conversion and because 
of that needs more advanced 
regexps than just r'#([a-zA-Z][a-zA-Z0-9]{%d,%d})'. It's possible to improve 
the regexps but to avoid issues 
altogether it might be better to use DOM mode instead.

Since this replacement is done every time a comment is displayed, going to DOM 
could have performance 
implications.

What do you think, should I try to improve the regexp or try to do something 
that's failsafe?

Original comment by jonasnoc...@gmail.com on 28 Mar 2009 at 10:01

GoogleCodeExporter commented 8 years ago

a dirty quick fix would be to check if '#' is after a '\s' or after a line start

Original comment by ikk...@gmail.com on 28 Mar 2009 at 2:07

GoogleCodeExporter commented 8 years ago

Take a look at this: 
http://adewale.jaiku.com/presence/5c2cf07037954357aa43c7cd5030b5a2

This shows that the real problem is that Jaiku formatting is being applied to 
http
links. The problem with the channels and the breaking of links that have 
underscores
in them are all instances of the same underlying bug.

We should be looking at some way to detect that something is an http or https 
link
and making sure that none of the Jaiku formatting rules are applied.

Original comment by adewale on 29 Mar 2009 at 4:33

GoogleCodeExporter commented 8 years ago

I thought I could use python-markdown2's <a 
href="http://code.google.com/p/python-
markdown2/wiki/LinkPatterns">link-patterns</a> to solve this but that suffers 
from the same problem as 
#jaikuengine:

    def _do_link_patterns(self, text):
        """Caveat emptor: there isn't much guarding against link
        patterns being formed inside other standard Markdown links, e.g.
        inside a [link def][like this].

        Dev Notes: *Could* consider prefixing regexes with a negative
        lookbehind assertion to attempt to guard against this.
        """

Original comment by jonasnoc...@gmail.com on 29 Mar 2009 at 5:06

GoogleCodeExporter commented 8 years ago

@adewale: It's not that simple. Jaikuengine applies formatting according to 
this scheme:

1. Markdown conversion — turns everything into html according to markdown's 
formatting rules.
2. Autolinking — makes links out of urls that wasn't converted during 
markdown.
3. Actor linking — makes links out of #channels and @usernames.

* http://example.com/test_underscore_problem is supposed to be handled by step 
2 but the underscores are 
converted to em in step 1.
* http://example.com/#test is  is supposed to be handled by step 2 but is 
messed up by step 3.
* [test](http://example.com/test_underscore_problem) works fine since 
conversion is handled fully in step 1.
* [test](http://example.com/#test) is supposed to be handled by step 1 but is 
messed up by step 3.
* [#test](http://example.com/) is supposed to be handled by step 1 but is 
messed up by step 3.
* [http://jaiku.com](http://test.com) is supposed to be handled by step 1 but 
is messed up by step 2.

etc.

Original comment by jonasnoc...@gmail.com on 29 Mar 2009 at 5:26

GoogleCodeExporter commented 8 years ago

By the way, I have a patch for the regexp that seem to work in many cases (but 
not all) — should I upload that to 
rietku while we think about how to solve this issue once and for all?

Original comment by jonasnoc...@gmail.com on 29 Mar 2009 at 9:05

GoogleCodeExporter commented 8 years ago

Please upload it. A partial solution with tests will move us closer to a full
solution that still passes those same tests.

Original comment by adewale on 29 Mar 2009 at 9:08

sastrabahu / jaikuengine

Channel detection in posts is over-eager and breaks links that contain # character #67