redpony / cdec

Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms
http://cdec-decoder.org/
Apache License 2.0
183 stars 77 forks source link

tokenize-anything.sh does not preserve number of lines #31

Open egrefen opened 10 years ago

egrefen commented 10 years ago

I've observed that corpus/tokenize-anything.sh does not necessarily preserve the number of lines on the files it is fed. I'll try and isolate the problem and give more details.

redpony commented 10 years ago

Thanks for the heads up. If you find a repro case, I'll fix it. Ideally by just junking the whole thing...

On Mon, Dec 2, 2013 at 1:42 PM, Edward Grefenstette < notifications@github.com> wrote:

I've observed that corpus/tokenize-anything.sh does not necessarily preserve the number of lines on the files it is fed. I'll try and isolate the problem and give more details.

— Reply to this email directly or view it on GitHubhttps://github.com/redpony/cdec/issues/31 .

egrefen commented 10 years ago

I have a repro case in my computer. Will see what lines are causing trouble when I get back from dinner.

egrefen commented 10 years ago

Okay the problem seems to be with rogue DOS carriage return characters (\r) in the text rather than with tokenize-anything.sh. I think wc doesn't spot them but your script converts them to \n.