Open AKorets opened 4 years ago
Whether this is a bug in the strict
option or something else is debatable.
I have the same issue, and it is caused by MyHeritage splitting CONC lines between a two-byte unicode character. The resulting line obviously no longer can be understood as Unicode.
I am hacking around this by catching the Unicode exceptions, and in case of a CONC line, concatenating the line with the next line (discarding the extra line break and the extra CONC on the next line), and trying again.
lines = iter(gedcom_file)
for line in lines:
take_next = True
conc_tag = b' CONC '
while take_next:
try:
line = line.decode('utf-8-sig')
take_next = False
except UnicodeDecodeError:
if conc_tag in line:
next_line = next(lines)
next_payload = next_line[next_line.find(conc_tag) + len(conc_tag):]
line = line[:-2] + next_payload
else:
raise
last_element = self.__parse_line(line_number, line, last_element, strict)
line_number += 1
I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this.
I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens.
I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this.
I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens.
Can you attach the shortest possible example of the issue? Maybe there is easy way to hack around, that I can suggest.
I am sorry, too late ;-)
I fixed problems and delete broken files.
I ran into this with ellipsis and "dot" characters in "notes" fields.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 24: invalid start byte
I solved it in parser.py line 144 via:
for line in gedcom_file:
try:
last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='strict'), last_element, strict)
except UnicodeDecodeError:
if not strict:
print('UnicodeDecodeError found:', line_number, line)
try:
last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='replace'), last_element, strict)
except:
print(' replace error:', line_number, line)
raise
else:
raise
line_number += 1
so that strict=False now replaces odd bytes with "?" (the replace default). It also tells you where it was so that you can fix it in the original database.
If the export from myheritage chops up your cyrillic unicode, then it's very easy to reconstruct them. I'm suggesting using the code below within the excellent solution suggested by rjsdotorg, which you could leave in for debug purposes.
err_flag = 0; # add this (custom code for cyrillic export from myheritage)
for line in gedcom_file:
# add this (custom code for cyrillic export from myheritage)
# if the prev string ended in D1 or D0, fix the 1st letter of the new string
new_letter = 0
if err_flag != 0:
if err_flag == 0xD1 and line[7] >= 0x80: new_letter = (err_flag << 8) + line[7] - 0xcd40
else: new_letter = (err_flag << 8) + line[7] - 0xcc80
line = line[:7] + (new_letter).to_bytes(2, 'big') + line[8:]
# if the new strings ends in D0 or D1 (+\r\n), then we remove the symbol and set the flag
if line[-3] == 0xD0 or line[-3] == 0xD1:
err_flag = line[-3]
line = line[:-3] + line[-2:]
else: err_flag = 0;
# END of custom code for cyrillic export from myheritage
# now back to https://github.com/nickreynke/python-gedcom/issues/47#issuecomment-980783824
try:
last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='strict'), last_element, strict)
# etc ...
Describe the bug I have an example file, where the encoding itself, crashing the loading process. gedcom_parser.parse_file(file_path, False) # Disable strict parsing This line receving this crash one_person_myheritage.rename to ged.log The example file are attached.
To Reproduce
Run this python lines:
The exception are UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 206: invalid continuation byte
Expected behavior When using False parameter, there is no reason for this exception.
Additional context Expected bugfix are in line
last_element = self.__parse_line(line_number, line.decode('utf-8-sig'), last_element, strict)
in functiondef parse(self, gedcom_stream, strict=True):