mideind Tokenizer issues

mideind / Tokenizer

A tokenizer for Icelandic text

Other

27 stars 6 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

A dot added in dates

#52 starkadur opened 5 days ago
3
detokenize and correct_spaces problem with hyphens and En dashes

#51 atlijas opened 3 weeks ago
0
Fix/colon time correct spaces

#50 gardarjuto closed 4 weeks ago
1
Modernization

#49 sveinbjornt closed 1 month ago
1
Fix cast

#48 vthorsteinsson closed 6 months ago
0
added handling for abbreviations

#47 thorunna closed 11 months ago
5
pkg_resources is deprecated

#46 sveinbjornt closed 1 month ago
0
pyproject.toml

#45 sveinbjornt closed 1 year ago
0
correct_spaces incorrectly inserts spaces into abbreviations

#44 atlijas opened 1 year ago
1
Bump min Python version to 3.7.

#43 HaukurPall closed 1 year ago
5
Add a META_BEGIN token kind

#42 peturorri closed 2 years ago
1
Refactoring raw token generation to better support long input text.

#41 HaukurPall closed 2 years ago
0
The tokenizer is slow when the input string is long.

#40 HaukurPall closed 2 years ago
2
Puncterrors

#39 Holado closed 2 years ago
0
Ospl

#38 Holado closed 2 years ago
0
Adda mánuð

#37 sigurdurb closed 3 years ago
3
Two dots

#36 starkadur closed 2 years ago
1
Token stream wrapper

#35 sultur closed 3 years ago
0
Spanfix

#34 Holado closed 3 years ago
0
Fix split_into_sentences ']]' bug

#33 sultur closed 3 years ago
1
Character omitted

#32 starkadur closed 2 years ago
0
Support colon-separated duration?

#31 sveinbjornt opened 3 years ago
0
Number tokenization

#30 sultur closed 3 years ago
0
Issuefixes

#29 Holado closed 3 years ago
0
Bigger ordinal numbers in the tokenizer

#28 helga-lvl closed 3 years ago
1
A few abbreviations definitions updated to not ending sentences

#27 Holado closed 3 years ago
0
Abbrevchanges

#26 Holado closed 3 years ago
0
The tokenizer is missing some abbreviations

#25 helga-lvl closed 3 years ago
3
split_into_sentences changes sentences

#24 bnika closed 2 years ago
7
Not enough test coverage

#23 peturorri opened 3 years ago
0
Onesentperline

#22 Holado closed 3 years ago
0
Spaces deleted

#21 starkadur closed 2 years ago
3
Feature/nondestructive tokenization

#20 peturorri closed 3 years ago
0
Use env markers in setup.py dependency declaration

#19 jokull closed 4 years ago
1
Twitter handles and @usernames can contain periods (@matur.a.mbl) but are broken into sentences

#18 sveinbjornt closed 3 years ago
1
Abbrevfix

#17 Holado closed 4 years ago
0
Support for citation characters

#16 sveinbjornt opened 4 years ago
0
Bandstrik skilin frá orði

#15 starkadur closed 4 years ago
1
Can this tokenizer be used for English Language also?

#14 Dhanachandra closed 4 years ago
1
Detokenization adds spaces to "o.s.frv."

#13 HaukurPall closed 4 years ago
3
UnboundLocalError: local variable 'unit' referenced before assignment

#12 HaukurPall closed 4 years ago
1
Inconsistent application of abbreviation expansion

#11 HaukurPall closed 4 years ago
3
Version 2.0 from wabbrevs branch

#10 vthorsteinsson closed 4 years ago
0
Command line tool; version 2.0.0

#9 vthorsteinsson closed 4 years ago
0
Domains

#8 sveinbjornt closed 5 years ago
0
Recognise plus-minus sign (±) as punctuation.

#7 sveinbjornt closed 5 years ago
0
Various tokenizer fixes/improvements

#6 sveinbjornt closed 5 years ago
0
Support for unicode vulgar fractions (e.g. ⅔)

#5 sveinbjornt closed 5 years ago
0
Tokeniize() options

#4 pallih closed 5 years ago
2
Added token type for numbers with trailing letters

#3 sveinbjornt closed 6 years ago
3