ownaginatious / fbchat-archive-parser

An application for parsing chat history from a Facebook data archive.
MIT License
312 stars 38 forks source link

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 5863, column 12969 #3

Closed jtrueblood88 closed 8 years ago

jtrueblood88 commented 8 years ago

Using the newest version. I'm new to python so I'm not sure what this means.

Thank you!

File "//anaconda/bin/fbcap", line 11, in sys.exit(main()) File "//anaconda/lib/python3.5/site-packages/fbchat_archive_parser/main.py", line 66, in main app.run() File "//anaconda/lib/python3.5/site-packages/clip.py", line 652, in run self.invoke(self.parse(tokens)) File "//anaconda/lib/python3.5/site-packages/clip.py", line 634, in invoke self._main.invoke(parsed) File "//anaconda/lib/python3.5/site-packages/clip.py", line 519, in invoke self._callback({k: v for k, v in iteritems(parsed) if k not in self._subcommands}) File "//anaconda/lib/python3.5/site-packages/fbchat_archive_parser/main.py", line 27, in fbcap progress_output=sys.stdout.isatty()) File "//anaconda/lib/python3.5/site-packages/fbchat_archive_parser/parser.py", line 98, in init self.parse_content() File "//anaconda/lib/python3.5/site-packages/fbchat_archive_parser/parser.py", line 107, in __parse_content for pos, element in ET.iterparse(self.stream, events=("start", "end")): File "//anaconda/lib/python3.5/xml/etree/ElementTree.py", line 1289, in __next** for event in self._parser.read_events(): File "//anaconda/lib/python3.5/xml/etree/ElementTree.py", line 1272, in read_events raise event File "//anaconda/lib/python3.5/xml/etree/ElementTree.py", line 1230, in feed self._parser.feed(data) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 5863, column 12969

ownaginatious commented 8 years ago

Could you post the entire stack trace?

jtrueblood88 commented 8 years ago

Just realized I should. See the edit above.

ownaginatious commented 8 years ago

Hmm, it would appear that the error is coming from the XML parsing library that's reading your messages.htm file. Are you sure the file is complete and/or not corrupted?

jtrueblood88 commented 8 years ago

I think it's complete. I was able to use a different parser on it, but wanted to use yours as it has more options for output.

ownaginatious commented 8 years ago

I think I fixed the issue. Let me know if the newest version works for you (version 0.5).

jtrueblood88 commented 8 years ago

Hmm..I think I got the same error. And it happens even if I only ask it to parse a conversation with one person.

File "//anaconda/bin/fbcap", line 11, in sys.exit(main()) File "//anaconda/lib/python3.5/site-packages/fbchat_archive_parser/main.py", line 70, in main app.run() File "//anaconda/lib/python3.5/site-packages/clip.py", line 652, in run self.invoke(self.parse(tokens)) File "//anaconda/lib/python3.5/site-packages/clip.py", line 634, in invoke self._main.invoke(parsed) File "//anaconda/lib/python3.5/site-packages/clip.py", line 519, in invoke self._callback({k: v for k, v in iteritems(parsed) if k not in self._subcommands}) File "//anaconda/lib/python3.5/site-packages/fbchat_archive_parser/main.py", line 31, in fbcap progress_output=sys.stdout.isatty()) File "//anaconda/lib/python3.5/site-packages/fbchat_archive_parser/parser.py", line 98, in init self.parse_content() File "//anaconda/lib/python3.5/site-packages/fbchat_archive_parser/parser.py", line 108, in __parse_content for pos, element in ET.iterparse(f, events=("start", "end")): File "//anaconda/lib/python3.5/xml/etree/ElementTree.py", line 1289, in __next** for event in self._parser.read_events(): File "//anaconda/lib/python3.5/xml/etree/ElementTree.py", line 1272, in read_events raise event File "//anaconda/lib/python3.5/xml/etree/ElementTree.py", line 1230, in feed self._parser.feed(data) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 5863, column 12969

ownaginatious commented 8 years ago

Do you happen to have the line from your messages.htm file that it's crashing on? I think it may be an encoding error on Facebook's end.

jtrueblood88 commented 8 years ago

I think so..But what does it mean that it's column 12969? I don't see how there's that many columns..

ownaginatious commented 8 years ago

Well, the document coming from Facebook is one giant unformatted line of data, so perhaps that's not helpful.

I just pushed a new version (0.5.post4) to PyPI. If the issue is being caused by your document not being interpreted as UTF-8 for some reason, then that should help.

jtrueblood88 commented 8 years ago

Okay..I'll try that. Thanks so much btw. I'll let you know if it works.

jtrueblood88 commented 8 years ago

:( nope that didn't do it. I'll keep trying though.

ownaginatious commented 8 years ago

Hmm, I'm really not sure what the issue could be. Does the program run at all, or does it crash immediately?

jtrueblood88 commented 8 years ago

It seems to run fine. When I ask it to look at a specific conversation, it skips all the other one and focuses on that one. I tried maybe having it export to csv instead of stdout, but that didn't work. I'd go in and look at the code myself but I can't even find out the environment that fbcap works from..its a bit too advanced for me.

ownaginatious commented 8 years ago

At this point, I think it may be because the XML parser being used is strict and technically HTML does not necessarily qualify a "strictly valid XML".

Unfortunately, the "less strict" drop-in replacement library lxml requires a lot of external dependencies that can be a pain to get installed. I'm going to try and implement something less efficient using BeautifulSoup as a fallback for situations like this. I'll respond to this ticket when it is ready.

ownaginatious commented 8 years ago

Okay, the tool now falls back to BeautifulSoup if something goes wrong while parsing using the iterative streaming parser. Please let me know if it works for you now (version 0.6post1).

eric-zhu94 commented 8 years ago

Just following up on this, Error is as follows:

The streaming parser crashed due to malformed XML. Falling back to the less strict/efficient BeautifulSoup parser. This may take a while... Traceback (most recent call last): File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/fbchat_archive_parser/parser.py", line 122, in parse_content parser=XMLParser(encoding=str('UTF-8'))): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1290, in __next for event in self._parser.read_events(): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1273, in read_events raise event File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1231, in feed self._parser.feed(data) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3613, column 71274

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/user/.virtualenvs/p3/bin/fbcap", line 11, in sys.exit(main()) File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/fbchat_archive_parser/main.py", line 73, in main app.run() File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/clip.py", line 652, in run self.invoke(self.parse(tokens)) File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/clip.py", line 634, in invoke self._main.invoke(parsed) File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/clip.py", line 519, in invoke self._callback({k: v for k, v in iteritems(parsed) if k not in self._subcommands}) File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/fbchat_archive_parser/main.py", line 34, in fbcap progress_output=sys.stdout.isatty()) File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/fbchat_archive_parser/parser.py", line 98, in init self.parse_content() File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/fbchat_archive_parser/parser.py", line 142, in __parse_content soup = BeautifulSoup(open(self.stream, 'r').read(), 'lxml') File "/Users/user/.virtualenvs/p3/lib/python3.5/site-packages/bs4/__init.py", line 156, in init** % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

relevant line as viewed in Sublime Text:

screen shot 2016-05-18 at 8 49 52 am
eric-zhu94 commented 8 years ago

Nevermind, works with lxml installed.

ownaginatious commented 8 years ago

@friendswithsalad thanks, looks like the lxml parser required for the fallback parser is a missing external dependency on your machine. It's a pain to install sometimes (especially on Windows), so I've switched it with html.parser, which is built into python. Let me know if it works for you now (0.6.post3)