joeyaurel / python-gedcom

Python module for parsing, analyzing, and manipulating GEDCOM files
https://gedcom.joeyaurel.dev
GNU General Public License v2.0
154 stars 39 forks source link

Add parsers #53

Open joeyaurel opened 4 years ago

joeyaurel commented 4 years ago

New PR in favor of original PR #50 by @cdhorn, because of conflicts with the latest version in the develop branch.

Todo list:

Original post:

Hi Nick, I have made a number of changes I'm hoping you'll consider merging. It might have been better to try to implement each in a separate branch, I'm sort of new at this so I apologize. To try to summarize them at a high level:

  • I removed the FileElement as it is really a duplicate of ObjectElement, added SourceElement, RepositoryElement, NoteElement, HeaderElement, SubmitterElement and SubmissionElement.
  • I added a set of subparsers for all of the various substructures in the standard within the given record types.
  • Added a get_record() method to all record elements that parses and returns the full record as structured data in a dict format. A lot of this is logic I needed for something else I'm starting to toy with and it seemed to make sense to me to have it in the base parser.
  • Added a Reader class that gives a couple simple methods to fetch all the records by type or all of them in one shot.
  • Added records.py with types for the Reader.
  • Broke exceptions out into errors.py.
  • Some more updates to tags.py to add a few more and fix some bugs/typos.
  • Added standards.py with links to the 5.5, 5.5.1, 5.5.1 GEDCOM-L, and 5.5.5 standards and used those when raising exceptions when applicable.
  • Added detect.py to detect the file encoding and the GEDCOM version. This added a dependency on the chardet and ansel packages. It now opens and parses Ansel files although I am not 100% sure I handled it right. As the codec is set when file opened it is not opened in binary mode and I removed the encode utf-8-sig stuff elsewhere. Please review those changes carefully, I've never really worked with different codecs and character sets before.
  • Gedcom 5.5.5 has strict requirements around validating format and logical structure, so if it detects a 5.5.5 file it raises an exception as the standard requires although it probably can parse the format of them fine. You can remove this if you think it should not be done.
  • Added type hints to just about everything so they should not be needed in the doc strings.
  • Cleaned up many doc strings and expanded them in a few areas. Thanks, Chris