issues
search
mideind
/
Tokenizer
A tokenizer for Icelandic text
Other
27
stars
6
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
A dot added in dates
#52
starkadur
opened
5 days ago
3
detokenize and correct_spaces problem with hyphens and En dashes
#51
atlijas
opened
3 weeks ago
0
Fix/colon time correct spaces
#50
gardarjuto
closed
4 weeks ago
1
Modernization
#49
sveinbjornt
closed
1 month ago
1
Fix cast
#48
vthorsteinsson
closed
6 months ago
0
added handling for abbreviations
#47
thorunna
closed
11 months ago
5
pkg_resources is deprecated
#46
sveinbjornt
closed
1 month ago
0
pyproject.toml
#45
sveinbjornt
closed
1 year ago
0
correct_spaces incorrectly inserts spaces into abbreviations
#44
atlijas
opened
1 year ago
1
Bump min Python version to 3.7.
#43
HaukurPall
closed
1 year ago
5
Add a META_BEGIN token kind
#42
peturorri
closed
2 years ago
1
Refactoring raw token generation to better support long input text.
#41
HaukurPall
closed
2 years ago
0
The tokenizer is slow when the input string is long.
#40
HaukurPall
closed
2 years ago
2
Puncterrors
#39
Holado
closed
2 years ago
0
Ospl
#38
Holado
closed
2 years ago
0
Adda mánuð
#37
sigurdurb
closed
3 years ago
3
Two dots
#36
starkadur
closed
2 years ago
1
Token stream wrapper
#35
sultur
closed
3 years ago
0
Spanfix
#34
Holado
closed
3 years ago
0
Fix split_into_sentences ']]' bug
#33
sultur
closed
3 years ago
1
Character omitted
#32
starkadur
closed
2 years ago
0
Support colon-separated duration?
#31
sveinbjornt
opened
3 years ago
0
Number tokenization
#30
sultur
closed
3 years ago
0
Issuefixes
#29
Holado
closed
3 years ago
0
Bigger ordinal numbers in the tokenizer
#28
helga-lvl
closed
3 years ago
1
A few abbreviations definitions updated to not ending sentences
#27
Holado
closed
3 years ago
0
Abbrevchanges
#26
Holado
closed
3 years ago
0
The tokenizer is missing some abbreviations
#25
helga-lvl
closed
3 years ago
3
split_into_sentences changes sentences
#24
bnika
closed
2 years ago
7
Not enough test coverage
#23
peturorri
opened
3 years ago
0
Onesentperline
#22
Holado
closed
3 years ago
0
Spaces deleted
#21
starkadur
closed
2 years ago
3
Feature/nondestructive tokenization
#20
peturorri
closed
3 years ago
0
Use env markers in setup.py dependency declaration
#19
jokull
closed
4 years ago
1
Twitter handles and @usernames can contain periods (@matur.a.mbl) but are broken into sentences
#18
sveinbjornt
closed
3 years ago
1
Abbrevfix
#17
Holado
closed
4 years ago
0
Support for citation characters
#16
sveinbjornt
opened
4 years ago
0
Bandstrik skilin frá orði
#15
starkadur
closed
4 years ago
1
Can this tokenizer be used for English Language also?
#14
Dhanachandra
closed
4 years ago
1
Detokenization adds spaces to "o.s.frv."
#13
HaukurPall
closed
4 years ago
3
UnboundLocalError: local variable 'unit' referenced before assignment
#12
HaukurPall
closed
4 years ago
1
Inconsistent application of abbreviation expansion
#11
HaukurPall
closed
4 years ago
3
Version 2.0 from wabbrevs branch
#10
vthorsteinsson
closed
4 years ago
0
Command line tool; version 2.0.0
#9
vthorsteinsson
closed
4 years ago
0
Domains
#8
sveinbjornt
closed
5 years ago
0
Recognise plus-minus sign (±) as punctuation.
#7
sveinbjornt
closed
5 years ago
0
Various tokenizer fixes/improvements
#6
sveinbjornt
closed
5 years ago
0
Support for unicode vulgar fractions (e.g. ⅔)
#5
sveinbjornt
closed
5 years ago
0
Tokeniize() options
#4
pallih
closed
5 years ago
2
Added token type for numbers with trailing letters
#3
sveinbjornt
closed
6 years ago
3
Next