Open egrefen opened 10 years ago
Thanks for the heads up. If you find a repro case, I'll fix it. Ideally by just junking the whole thing...
On Mon, Dec 2, 2013 at 1:42 PM, Edward Grefenstette < notifications@github.com> wrote:
I've observed that corpus/tokenize-anything.sh does not necessarily preserve the number of lines on the files it is fed. I'll try and isolate the problem and give more details.
— Reply to this email directly or view it on GitHubhttps://github.com/redpony/cdec/issues/31 .
I have a repro case in my computer. Will see what lines are causing trouble when I get back from dinner.
Okay the problem seems to be with rogue DOS carriage return characters (\r
) in the text rather than with tokenize-anything.sh
. I think wc
doesn't spot them but your script converts them to \n
.
I've observed that corpus/tokenize-anything.sh does not necessarily preserve the number of lines on the files it is fed. I'll try and isolate the problem and give more details.