joeyaurel / python-gedcom

Python module for parsing, analyzing, and manipulating GEDCOM files
https://gedcom.joeyaurel.dev
GNU General Public License v2.0
155 stars 39 forks source link

Allow parsing files with UTF-8 BOM #5

Closed jbvsmo closed 5 years ago

jbvsmo commented 5 years ago

I don't know what the gedcom 5.5 format says about this, but for the sake of simplicity and because most text editors nowadays add it by default, this code should detect and ignore an UTF-8 BOM mark at the start of the file.

It is super complicated to understand why the loading failed because it only says: Line 1 of document violates GEDCOM format 5.5 and nothing more. Because these bytes are meant to be ignored, you can't see the issue on line 1 unless you load the file in python and print a representation of said line.

One option is to use the utf-8-sig codec instead. https://docs.python.org/3/library/codecs.html#module-encodings.utf_8_sig

joeyaurel commented 5 years ago

Hey @jbvsmo! Thank you for your issue.

The problems were resolved with the issue #6 and a new version of the parser should be up really soon.

damonbrodie commented 5 years ago

I think this can be closed now - my previous commit now handles BOM.

Nevermind - I see Nick commented on this already.

joeyaurel commented 5 years ago

It sure does :) I just published a new version @jbvsmo https://pypi.org/project/python-gedcom/