joeyaurel / python-gedcom

Python module for parsing, analyzing, and manipulating GEDCOM files
https://gedcom.joeyaurel.dev
GNU General Public License v2.0
155 stars 39 forks source link

Support for MyHeritage and Ancestry generated GEDCOM files #6

Closed damonbrodie closed 5 years ago

damonbrodie commented 5 years ago

It seems that this is one of the only actively developed gedcom parsers in python these days (2018). Ancestry seems to produce gedcom files that break the parsing:

python3 parse_gedcom.py 
Traceback (most recent call last):
  File "parse_gedcom.py", line 4, in <module>
    gedcom = Gedcom(file_path)
  File "/usr/local/lib/python3.7/site-packages/python_gedcom-0.2.0.dev0-py3.7.egg/gedcom/__init__.py", line 148, in __init__
  File "/usr/local/lib/python3.7/site-packages/python_gedcom-0.2.0.dev0-py3.7.egg/gedcom/__init__.py", line 224, in __parse
  File "/usr/local/lib/python3.7/site-packages/python_gedcom-0.2.0.dev0-py3.7.egg/gedcom/__init__.py", line 262, in __parse_line
SyntaxError: Line `65692` of document violates GEDCOM format 5.5

the lines in question are:

4 TEXT DOREY – Ethel Marie, 84, of Liverpool, passed away peacefully on Wednesday, July 27, 2016 in Queens Manor, Liverpool.
Born in Western Head, Queens County, she was a daughter of the late William an
5 CONC d Hilda (Guest) Wolfe.
Ethel was a former waitress at the Mersey Hotel in the late forties. She was a member of the coffee bowling league for thirty years and was a volunteer with the Canadian Red Cr

Notice the carriage return in the TEXT data that puts the next line "Born in Western Head..." onto a line by itself.

I believe that this breaks the gedcom format (though I have not researched this extensively in the spec). That being said, Ancestry is one of the largest genealogy providers and I think it would be ideal to have a parser that can parse the output from this provider.

I'm wondering if there is any interest handling this use case here? If so I can try and work up a patch and submit a PR.

I think there is a need to have a gedcom parser that can read "real world" gedcom files.

damonbrodie commented 5 years ago

I've forked your repository and I've added new logic that can optionally (but disabled by default) handle the issues produced by MyHeritage and Ancestry. I've also started documenting the methods in the Readme. Once I've finished that, I'll submit a PR. If that is accepted then I'd like to push the develop branch to master and publish the updated module.

damonbrodie commented 5 years ago

PR created. I've got more README updates to make, but I thought I would kick off the PR now so that it can be reviewed.

joeyaurel commented 5 years ago

Thank you @nomadyow! :) Reviewed the PR and merged it into master. Test files will be added in a later state to be sure that the parser works as expected.