rug-compling / Alpino

Alpino parser and related tools for Dutch
GNU Lesser General Public License v2.1
22 stars 2 forks source link

Compact corpora fix handling of file names with spaces #2

Closed danieldk closed 4 years ago

danieldk commented 4 years ago

Compact corpus index entries were read using whitespace as a field delimiter. This leads to the problem that in an entry with spaces, such as

50 gr geroosterd wit en zwart sesamzaad.p.1.s.1.xml     A       1e

'gr' is interpreted as the offset and 'geroosterd' as the size. Since these are also valid base64 strings, this leads to garbage offsets and sizes. Fix this by only using tabs as delimiters.