[Bug] Non strict loading not working

joeyaurel / python-gedcom

Python module for parsing, analyzing, and manipulating GEDCOM files

https://gedcom.joeyaurel.dev

GNU General Public License v2.0

158 stars 39 forks source link

[Bug] Non strict loading not working #47

Open AKorets opened 4 years ago

AKorets commented 4 years ago

Describe the bug I have an example file, where the encoding itself, crashing the loading process. gedcom_parser.parse_file(file_path, False) # Disable strict parsing This line receving this crash one_person_myheritage.rename to ged.log The example file are attached.

To Reproduce

Load the one_person_myheritage.rename to ged.log
rename file to one_person_myheritage.get

Run this python lines:

gedcom_parser = Parser()    
gedcom_parser.parse_file( "one_person_myheritage.ged"  , False) # Disable strict parsing

The exception are UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 206: invalid continuation byte

Expected behavior When using False parameter, there is no reason for this exception.

Additional context Expected bugfix are in line last_element = self.__parse_line(line_number, line.decode('utf-8-sig'), last_element, strict) in function def parse(self, gedcom_stream, strict=True):

mikaelho commented 4 years ago

Whether this is a bug in the strict option or something else is debatable.

I have the same issue, and it is caused by MyHeritage splitting CONC lines between a two-byte unicode character. The resulting line obviously no longer can be understood as Unicode.

I am hacking around this by catching the Unicode exceptions, and in case of a CONC line, concatenating the line with the next line (discarding the extra line break and the extra CONC on the next line), and trying again.

        lines = iter(gedcom_file)

        for line in lines:
            take_next = True
            conc_tag = b' CONC '
            while take_next:
                try:
                    line = line.decode('utf-8-sig')
                    take_next = False
                except UnicodeDecodeError:
                    if conc_tag in line:
                        next_line = next(lines)
                        next_payload = next_line[next_line.find(conc_tag) + len(conc_tag):]
                        line = line[:-2] + next_payload
                    else:
                        raise
            last_element = self.__parse_line(line_number, line, last_element, strict)
            line_number += 1

slavkoja commented 3 years ago

I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this.

I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens.

AKorets commented 3 years ago

I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this.

I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens.

Can you attach the shortest possible example of the issue? Maybe there is easy way to hack around, that I can suggest.

slavkoja commented 3 years ago

I am sorry, too late ;-)

I fixed problems and delete broken files.

rjsdotorg commented 2 years ago

I ran into this with ellipsis and "dot" characters in "notes" fields. UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 24: invalid start byte

I solved it in parser.py line 144 via:

        for line in gedcom_file:
            try:
                last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='strict'), last_element, strict)
            except UnicodeDecodeError:
                if not strict:
                    print('UnicodeDecodeError found:', line_number, line)
                    try:
                        last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='replace'), last_element, strict)
                    except:
                        print('  replace error:', line_number, line)
                        raise
                else:
                    raise
            line_number += 1

so that strict=False now replaces odd bytes with "?" (the replace default). It also tells you where it was so that you can fix it in the original database.

nkapyrin commented 2 years ago

If the export from myheritage chops up your cyrillic unicode, then it's very easy to reconstruct them. I'm suggesting using the code below within the excellent solution suggested by rjsdotorg, which you could leave in for debug purposes.

    err_flag = 0; # add this (custom code for cyrillic export from myheritage)
    for line in gedcom_file:
        # add this (custom code for cyrillic export from myheritage)
        # if the prev string ended in D1 or D0, fix the 1st letter of the new string
        new_letter = 0
        if err_flag != 0:
            if err_flag == 0xD1 and line[7] >= 0x80: new_letter = (err_flag << 8) + line[7] - 0xcd40
            else: new_letter = (err_flag << 8) + line[7] - 0xcc80
            line = line[:7] + (new_letter).to_bytes(2, 'big') + line[8:]
        # if the new strings ends in D0 or D1 (+\r\n), then we remove the symbol and set the flag
        if line[-3] == 0xD0 or line[-3] == 0xD1:
            err_flag = line[-3]
            line = line[:-3] + line[-2:]
        else: err_flag = 0;
        # END of custom code for cyrillic export from myheritage

        # now back to https://github.com/nickreynke/python-gedcom/issues/47#issuecomment-980783824
        try:
            last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='strict'), last_element, strict)
            # etc ...