Closed PhilipMay closed 1 year ago
I think the aim should be to tokenize that as:
[
https://one_link.com
]
(
https://other_link.com
)
.
For the general use case of tokenizing markdown text, I would suggest to convert the markdown to HTML and to use tokenize_xml
instead of tokenize_text
. Here’s a quick example on the command line using pandoc:
echo "This is a Markdown link: [https://one_link.com](https://other_link.com)." | pandoc | somajo-tokenizer -tx --split-sentences -
<p>
This regular
is regular
a regular
Markdown regular
link regular
: symbol
<a href="https://other_link.com">
https://one_link.com URL
</a>
. symbol
</p>
Or, if you add --strip-tags
(note that this removes the link to https://other_link.com
):
This regular
is regular
a regular
Markdown regular
link regular
: symbol
https://one_link.com URL
. symbol
Yes. @tsproisl thanks for the work around.... but: Do you think this is a bug in SoMaJo?
I am more interested in a fix than in a workaround to be honest. :-)
Yes, I said the aim should be to tokenize the markdown link as you suggested, i.e. I would consider it a bug ;o). The solution I opted for is to disallow square brackets in URLs (according to RFC 1738 they "must always be encoded" anyway).
Many thanks. Could you do a new release now that this is fixed please?
I’ve just released v2.3.1 containing the fix! (I wanted to re-run the evaluations on the test corpora first.)
I’ve just released v2.3.1 containing the fix! (I wanted to re-run the evaluations on the test corpora first.)
Many thanks! :-)
Hi. I have this text:
This is a Markdown link: [https://one_link.com](https://other_link.com).
And split it with SoMaJo:
Result is:
IMHO this shows a bug with the split of the MD link.
The "." should not be part of the link. The brackets also not be part of a link. And it is not one link but two...
What do you think?