Closed 62mkv closed 5 years ago
What happens if you use gedcom.get_element_list()
in place of gedcom.element_list()
?
Edit: Also note that the current version of the project may not work with some aspects of GEDCOM 5.5.1 files (however it should definitely be able to deal with that first line).
@keithpetro by looking at the stack trace, one can see that the script does not even reach that statement, so I'm pretty sure it will change nothing
Regarding the format, FWIW the first line is identical, so it shouldn't fail there
Taking a closer look, I found that the constructor hits this error on the last line of my GEDCOM file if I remove the EOL characters. Perhaps it's an issue with what EOL characters are present in your file and/or how they are handled?
Edit: On a side note though, you should change your code to use gedcom.get_element_list()
, as there is no element_list()
method for the Gedcom
class.
Edit: I made a quick family tree on MyHeritage and exported it for testing. I am experiencing the exact same issue you are, on the first line. What I find odd is that it appears that the line ends in a Carriage-Return and then a Line-Feed, which is completely valid (and I had no issues with a file exported from Ancestry which also used CRLF).
@nickreynke maybe you have an example of Gedcom file that this version can parse successfully? Can you share it? I would analyze the difference and possibly could adapt my file somehow, or update the code
I've done a bit more testing and found that the first line in a regular GEDCOM file (like the one I have from Ancestry) should be simply (in byte representation):
b'0 HEAD\n'
Whereas the file from MyHeritage has:
b'\xef\xbb\xbf0 HEAD\r\n'
0xEFBBBF is the BOM (Byte Order Mark) for UTF-8. This is outside of GEDCOM spec, and I expect that any programs which are able to read these files have to specifically implement out of spec workarounds specifically for MyHeritage files.
Indeed, the .ged file from MyHeritage has BOM
Thanks!! Now it finally begins to parse. I see that either MyHeritage is shitty on formats, either it is allowed in 5.5.1, but there're multiline entries in exported .ged file, which breaks the parser (Line 32 of document violates GEDCOM format)
MyHeritage export is ridiculous !! It even splits unicode words in halves!! so that first byte is at the end of line N, and the second one in the beginning of line N+1
@nickreynke I have prepared a (very simple) fix for this. All that's required to ignore BOM at the start of a UTF-8 encoded file is to decode with 'utf-8-sig'
instead of 'utf-8'
.
Edit: Some further reading on Byte Order Marks and GEDCOM would be worthwhile, as currently this project seems to only handle UTF-8. In the future, it would be nice to be able to handle ANSEL, UTF-8 as well as UTF-16 in order to be fully compliant with GEDCOM 5.5.1 standards. GEDCOM 5.5 does not have any requirements for UTF-16.
Further reading regarding character sets/encoding in GEDCOM.
@62mkv I haven't experienced that issue. How are you testing that?
for some reason, Ancestry.com was able to import MyHeritage GEDCOM file without visible defects...
@KeithPetro what do you mean with "how am I testing that" ?
@62mkv What are you using that is showing you that the words are split?
Ancestry's GEDCOM reading code is likely quite robust and allows for various different variations (both valid and invalid) in GEDCOM files.
Like this one:
2 NOTE Назвали сына Кирюшей дружно и с надеждой, что потом появится Фёдор Кириллович.... В поколении сына имя его был�
3 CONC � крайне редким и сам он любил имя дружка своего Костя В 2007 у них и появится Костя И в это время имя Кирилла ста
Cyrillic is weird but not THAT weird )) Those strange icons are just parts of Unicode word split on different lines. If I remove the CRLF and 3 CONC item, it turns into
2 NOTE Назвали сына Кирюшей дружно и с надеждой, что потом появится Фёдор Кириллович.... В поколении сына имя его было крайне редким и сам он любил имя дружка своего Костя В 2007 у них и появится Костя И в это время имя Кирилла ста
As you could notice, the two non-sensical chars have gone and normal Cyrillic letter 'о' has appeared instead of those
(by "Unicode word" I mean a multi-byte Unicode sequence, describing single character; for Cyrillic it's two bytes)
While weird, I think it is actually valid.
According to the GEDCOM standard, the CONC
tag is meant to signify concatenation without saving the EOL characters prior to the line terminator.
You could split a UTF-16 character mid-way and still properly concatenate it just fine.
Then it's again an issue with parser, because it stops on all such lines ("byte .. is not a valid utf-8 character")
@62mkv and @KeithPetro the bug should be resolved by the current release v0.2.2dev
. ✌
I've download and installed
python-gedcom
v.0.2.0.devI run it as follows:
This GEDCOM file starts with
and I get the following error:
What am I doing wrong? This GEDCOM file has been exported from MyHeritage recently
UPD: this is with Python 3.6 under Windows 10 x64