joeyaurel / python-gedcom

Python module for parsing, analyzing, and manipulating GEDCOM files
https://gedcom.joeyaurel.dev
GNU General Public License v2.0
155 stars 39 forks source link

error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev #3

Closed 62mkv closed 5 years ago

62mkv commented 6 years ago

I've download and installed python-gedcom v.0.2.0.dev

I run it as follows:

from gedcom import Gedcom

file_path = '7q4425_661384sh82b72570424am5.ged' # Path to your `.ged` file
gedcom = Gedcom(file_path)

print(gedcom.element_list())

This GEDCOM file starts with

0 HEAD
1 GEDC
2 VERS 5.5.1
2 FORM LINEAGE-LINKED

and I get the following error:

Traceback (most recent call last):
  File "script.py", line 4, in <module>
    gedcom = Gedcom(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 148, in __init__
    self.__parse(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 224, in __parse
    last_element = self.__parse_line(line_number, line.decode('utf-8'), last_element)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 263, in __parse_line
    raise SyntaxError(error_message)
SyntaxError: Line `1` of document violates GEDCOM format
See: http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gctoc.htm

What am I doing wrong? This GEDCOM file has been exported from MyHeritage recently

UPD: this is with Python 3.6 under Windows 10 x64

KeithPetro commented 6 years ago

What happens if you use gedcom.get_element_list() in place of gedcom.element_list()?

Edit: Also note that the current version of the project may not work with some aspects of GEDCOM 5.5.1 files (however it should definitely be able to deal with that first line).

62mkv commented 6 years ago

@keithpetro by looking at the stack trace, one can see that the script does not even reach that statement, so I'm pretty sure it will change nothing

Regarding the format, FWIW the first line is identical, so it shouldn't fail there

KeithPetro commented 6 years ago

Taking a closer look, I found that the constructor hits this error on the last line of my GEDCOM file if I remove the EOL characters. Perhaps it's an issue with what EOL characters are present in your file and/or how they are handled?

Edit: On a side note though, you should change your code to use gedcom.get_element_list(), as there is no element_list() method for the Gedcom class.

Edit: I made a quick family tree on MyHeritage and exported it for testing. I am experiencing the exact same issue you are, on the first line. What I find odd is that it appears that the line ends in a Carriage-Return and then a Line-Feed, which is completely valid (and I had no issues with a file exported from Ancestry which also used CRLF).

62mkv commented 6 years ago

@nickreynke maybe you have an example of Gedcom file that this version can parse successfully? Can you share it? I would analyze the difference and possibly could adapt my file somehow, or update the code

KeithPetro commented 6 years ago

I've done a bit more testing and found that the first line in a regular GEDCOM file (like the one I have from Ancestry) should be simply (in byte representation):

b'0 HEAD\n'

Whereas the file from MyHeritage has:

b'\xef\xbb\xbf0 HEAD\r\n'

0xEFBBBF is the BOM (Byte Order Mark) for UTF-8. This is outside of GEDCOM spec, and I expect that any programs which are able to read these files have to specifically implement out of spec workarounds specifically for MyHeritage files.

62mkv commented 6 years ago

Indeed, the .ged file from MyHeritage has BOM

Thanks!! Now it finally begins to parse. I see that either MyHeritage is shitty on formats, either it is allowed in 5.5.1, but there're multiline entries in exported .ged file, which breaks the parser (Line 32 of document violates GEDCOM format)

62mkv commented 6 years ago

MyHeritage export is ridiculous !! It even splits unicode words in halves!! so that first byte is at the end of line N, and the second one in the beginning of line N+1

KeithPetro commented 6 years ago

@nickreynke I have prepared a (very simple) fix for this. All that's required to ignore BOM at the start of a UTF-8 encoded file is to decode with 'utf-8-sig' instead of 'utf-8'.

Edit: Some further reading on Byte Order Marks and GEDCOM would be worthwhile, as currently this project seems to only handle UTF-8. In the future, it would be nice to be able to handle ANSEL, UTF-8 as well as UTF-16 in order to be fully compliant with GEDCOM 5.5.1 standards. GEDCOM 5.5 does not have any requirements for UTF-16.

Further reading regarding character sets/encoding in GEDCOM.

KeithPetro commented 6 years ago

@62mkv I haven't experienced that issue. How are you testing that?

62mkv commented 6 years ago

for some reason, Ancestry.com was able to import MyHeritage GEDCOM file without visible defects...

@KeithPetro what do you mean with "how am I testing that" ?

KeithPetro commented 6 years ago

@62mkv What are you using that is showing you that the words are split?

Ancestry's GEDCOM reading code is likely quite robust and allows for various different variations (both valid and invalid) in GEDCOM files.

62mkv commented 6 years ago

Like this one:

2 NOTE Назвали сына Кирюшей дружно и с надеждой, что потом появится Фёдор Кириллович....  В поколении сына имя его был�
3 CONC � крайне редким и сам он любил имя дружка своего Костя В 2007 у них и появится Костя  И в это время имя Кирилла ста

Cyrillic is weird but not THAT weird )) Those strange icons are just parts of Unicode word split on different lines. If I remove the CRLF and 3 CONC item, it turns into

2 NOTE Назвали сына Кирюшей дружно и с надеждой, что потом появится Фёдор Кириллович....  В поколении сына имя его было крайне редким и сам он любил имя дружка своего Костя В 2007 у них и появится Костя  И в это время имя Кирилла ста

As you could notice, the two non-sensical chars have gone and normal Cyrillic letter 'о' has appeared instead of those

62mkv commented 6 years ago

(by "Unicode word" I mean a multi-byte Unicode sequence, describing single character; for Cyrillic it's two bytes)

KeithPetro commented 6 years ago

While weird, I think it is actually valid.

According to the GEDCOM standard, the CONC tag is meant to signify concatenation without saving the EOL characters prior to the line terminator.

You could split a UTF-16 character mid-way and still properly concatenate it just fine.

62mkv commented 6 years ago

Then it's again an issue with parser, because it stops on all such lines ("byte .. is not a valid utf-8 character")

joeyaurel commented 5 years ago

@62mkv and @KeithPetro the bug should be resolved by the current release v0.2.2dev. ✌