tsproisl / SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.
GNU General Public License v3.0
135 stars 21 forks source link

Markdown link splitting bug. #26

Closed PhilipMay closed 1 year ago

PhilipMay commented 1 year ago

Hi. I have this text: This is a Markdown link: [https://one_link.com](https://other_link.com).

And split it with SoMaJo:

from somajo import SoMaJo

tokenizer = SoMaJo("de_CMC")

paragraphs = ["This is a Markdown link: [https://one_link.com](https://other_link.com)."]

sentences = tokenizer.tokenize_text(paragraphs)
for sentence in sentences:
    for token in sentence:
        print("{}\t{}\t{}".format(token.text, token.token_class, token.extra_info))
    print()

Result is:

This    regular 
is  regular 
a   regular 
Markdown    regular 
link    regular SpaceAfter=No
:   symbol  
[   symbol  SpaceAfter=No
https://one_link.com](https://other_link.com).  URL 

IMHO this shows a bug with the split of the MD link.

The "." should not be part of the link. The brackets also not be part of a link. And it is not one link but two...

What do you think?

tsproisl commented 1 year ago

I think the aim should be to tokenize that as:

[
https://one_link.com
]
(
https://other_link.com
)
.

For the general use case of tokenizing markdown text, I would suggest to convert the markdown to HTML and to use tokenize_xml instead of tokenize_text. Here’s a quick example on the command line using pandoc:

echo "This is a Markdown link: [https://one_link.com](https://other_link.com)." | pandoc | somajo-tokenizer -tx --split-sentences -
<p>
This    regular
is      regular
a       regular
Markdown        regular
link    regular
:       symbol
<a href="https://other_link.com">
https://one_link.com    URL
</a>
.       symbol
</p>

Or, if you add --strip-tags (note that this removes the link to https://other_link.com):

This    regular
is      regular
a       regular
Markdown        regular
link    regular
:       symbol
https://one_link.com    URL
.       symbol
PhilipMay commented 1 year ago

Yes. @tsproisl thanks for the work around.... but: Do you think this is a bug in SoMaJo?

I am more interested in a fix than in a workaround to be honest. :-)

tsproisl commented 1 year ago

Yes, I said the aim should be to tokenize the markdown link as you suggested, i.e. I would consider it a bug ;o). The solution I opted for is to disallow square brackets in URLs (according to RFC 1738 they "must always be encoded" anyway).

PhilipMay commented 1 year ago

Many thanks. Could you do a new release now that this is fixed please?

tsproisl commented 1 year ago

I’ve just released v2.3.1 containing the fix! (I wanted to re-run the evaluations on the test corpora first.)

PhilipMay commented 1 year ago

I’ve just released v2.3.1 containing the fix! (I wanted to re-run the evaluations on the test corpora first.)

Many thanks! :-)