rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.19k stars 465 forks source link

apply_bpe.py doubles empty lines #48

Closed emjotde closed 6 years ago

emjotde commented 6 years ago

Hi, it seems apply_bpe.py duplicates empty lines, minimal example:

echo -e '\n\n' | wc -l
3

and twice as many with the script.

echo -e '\n\n' | ./subword-nmt/apply_bpe.py -c bpe.codes | wc -l
6

Can you reproduce this?

emjotde commented 6 years ago

I believe this fixes it. The problem is that for empty lines leading_whitespace and trailing_whitespace overlap. The length check in the first if should fix that.

    def process_line(self, line):                                                                                                                                                                                                                                     
        """segment line, dealing with leading and trailing whitespace"""                                                                                                                                                                                              

        out = ""                                                                                                                                                                                                                                                      

        leading_whitespace = len(line)-len(line.lstrip())                                                                                                                                                                                                             
        if leading_whitespace and len(line.lstrip()):                                                                                                                                                                                                                 
            out += line[:leading_whitespace]                                                                                                                                                                                                                          

        out += self.segment(line)                                                                                                                                                                                                                                     

        trailing_whitespace = len(line)-len(line.rstrip())                                                                                                                                                                                                            
        if trailing_whitespace:                                                                                                                                                                                                                                       
            out += line[-trailing_whitespace:]                                                                                                                                                                                                                        

        return out 
rsennrich commented 6 years ago

thanks; fixed.