tsproisl / SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.
GNU General Public License v3.0
135 stars 21 forks source link

Other issue with Markdown style links. #28

Closed PhilipMay closed 4 months ago

PhilipMay commented 7 months ago

Links in this format: "*[Neubau](https://www.some-link.com)*" have an issue.

Code:

text = "*[Neubau](https://www.some-link.com)*"
sentences = somajo.tokenize_text([text])
for sentence in sentences:
    for token in sentence:
        print(f"{token.text}\t{token.token_class}\t{token.extra_info}")

Returns:

*   symbol  SpaceAfter=No
[   symbol  SpaceAfter=No
Neubau  regular SpaceAfter=No
]   symbol  SpaceAfter=No
(   symbol  SpaceAfter=No
https://www.some-link.com)* URL 

Should return something like this:

*   symbol  SpaceAfter=No
[   symbol  SpaceAfter=No
Neubau  regular SpaceAfter=No
]   symbol  SpaceAfter=No
(   symbol  SpaceAfter=No
https://www.some-link.com   URL
)       symbol SpaceAfter=No
*   symbol SpaceAfter=No

Full code: https://colab.research.google.com/drive/16-CKdzp20Gin02emrLVeHfFFir2veK8M?usp=sharing

tsproisl commented 7 months ago

I’ve decided to explicitly add markdown links, so this should be fixed now, with the caveat that it will fail if the link description contains square brackets or if the URL contains parentheses.