UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte

songyuc commented 4 years ago

Hi, guys, I am trying using the scripts in this repo to preprocess the im2latex dataset, but I met this error as,

2020-08-26 17:16:23,199 root INFO Script being executed: scripts/preprocessing/preprocess_formulas.py Traceback (most recent call last): File "scripts/preprocessing/preprocess_formulas.py", line 87, in main(sys.argv[1:]) File "scripts/preprocessing/preprocess_formulas.py", line 65, in main for line in fin: File "/home/songyuc/software/python/anaconda/anaconda3/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte

So, how can I solve this? Any answer or idea will be appreciated!

da03 commented 4 years ago

Hmm I think using python2.7 will solve this, or

try with io.open(file_path_dest,"r",encoding='ascii')?

songyuc commented 4 years ago

@da03 , oh, it worked! Thanks a lot!

songyuc commented 4 years ago

Hi, @da03 , I want to confirm whether the processing in this repo is the same process in the paper, Image-to-Markup Generation with Coarse-to-Fine Attention?

da03 commented 4 years ago

Yes it's the same. You can also found processed data at http://lstm.seas.harvard.edu/latex/data/

songyuc commented 4 years ago

Wow, it is great. I hope to follow your work to do some research. And I guess, these two .gz files are the same, am I right? 2020-08-27 15-17-33屏幕截图_meitu_1

TITC commented 2 years ago

with io.open(file_path_dest,"r",encoding='ascii')

still not work at python3.7 env

before adjust

    with open(temp_file, 'w') as fout:
        prepre = open(output_file, 'r').read().replace('\r', ' ')  # delete \r
        # replace split, align with aligned
        prepre = re.sub(r'\\begin{(split|align|alignedat|alignat|eqnarray)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{aligned}\2\\end{aligned}', prepre, flags=re.S)
        prepre = re.sub(r'\\begin{(smallmatrix)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{matrix}\2\\end{matrix}', prepre, flags=re.S)
        fout.write(prepre)

after adjust

    with open(temp_file, 'w') as fout:
        # prepre = open(output_file, 'r').read().replace('\r', ' ')  # delete \r
        prepre = io.open(output_file, 'r', encoding='ascii').read().replace(
            '\r', ' ')  # delete \r
        # replace split, align with aligned
        prepre = re.sub(r'\\begin{(split|align|alignedat|alignat|eqnarray)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{aligned}\2\\end{aligned}', prepre, flags=re.S)
        prepre = re.sub(r'\\begin{(smallmatrix)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{matrix}\2\\end{matrix}', prepre, flags=re.S)
        fout.write(prepre)

show error

2022-04-23 16:52:56,976 root  INFO     Script being executed: preprocess_formulas.py
2022-04-23 16:52:56,976 root  INFO     Script being executed: preprocess_formulas.py
Traceback (most recent call last):
  File "preprocess_formulas.py", line 103, in <module>
    main(sys.argv[1:])
  File "preprocess_formulas.py", line 66, in main
    prepre = io.open(output_file, 'r', encoding='ascii').read().replace(
  File "/home/yhtao/anaconda3/envs/latex_ocr/lib/python3.7/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 854136: ordinal not in range(128)

Yuxiang1995 commented 1 year ago

@TITC this work for me io.open(output_file, 'r', encoding='latin-1')

harvardnlp / im2markup

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte #41