New parser using json.loads aborts on some pages.

dethrophes commented 6 years ago

The json parser doesn't seem to cope with some of the pages. e.g. https://www.audible.com/pd/B00IZOP8CI

Invalid \escape: line 11 column 263 (char 443)

specifically it seems to be the\)in the following.

, "description": "Do you know why…a mortgage is literally a death pledge? …why guns have girls’ names? …why salt is related to soldier? You’re about to find out…<?-‘mä-lä-ji-kän) is:*Witty (wi-te\): Full of clever humor*Erudite (er-?-dit): Showing knowledge*Ribald (ri-b?ld): Crude, offensiveThe Etymologicon is a ce strange underpinnings of the English language. It explains: How you get from “gruntled” to “disgruntled”; why you are absolutely right to believe that your meager salary barely covers “main of coffee shops in the world (hint: Seattle) connects to whaling in Nantucket; and what precisely the Rolling Stones have to do with gardening. "

macr0dev commented 6 years ago

what a weird book description. Boy they've really snarfed things up with these new pages though.

Jeez, well - I suppose we could try to sanitize it before it gets to the json parser. But by the time you can do that we might as well go back to using the regex and search functions.

macr0dev commented 6 years ago

Tests are looking good with the changes below. I just moved the data over to a regular variable and replaced the \ with nothing. If this gets out of hand, we might need a sub for sanitizing different things. Gonna commit this later.

if date is None : for r in html.xpath('//script[contains (@type, "application/ld+json")]'): page_content = r.text_content() page_content = page_content.replace('\\', '') json_data=json_decode(page_content)

macr0dev commented 6 years ago

Commit is done and tested on my production box.

https://github.com/macr0dev/Audiobooks.bundle/commit/9a5462159fc5b2bfae6f1d5e6a5439749dd78678

dethrophes commented 6 years ago

You can't really just remove all escapes, they are valid sequences.. so I'd be reluctant to include this change,

On Thu, Nov 9, 2017 at 12:46 AM, macr0dev notifications@github.com wrote:

Commit is done and tested on my production box.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/macr0dev/Audiobooks.bundle/issues/21#issuecomment-342999791, or mute the thread https://github.com/notifications/unsubscribe-auth/ABbA8yBfbkCAXRfQQ51h70i8DmsR7QCmks5s0j1KgaJpZM4QXDot .

macr0dev commented 6 years ago

well, it was a short lived victory anyway. It's fixed the book you found it in - but I'm starting to see problems with other books. Both with and without the fix I came up with.

macr0dev commented 6 years ago

OK, did some digging and found the other problem and made a few more changes.

1) I've switched from just removing the backslash to escaping it with another backslash.

2) I've discovered that some of the descriptions have a hidden '\n' in them. So I'm and just straight removing it.

I took the source from the pages and ran them through a JSON validator. Those new lines were definitely breaking things, they just aren't in every page. Escaping the backslash makes the validator happy so that seems to be good. Honestly I think the backslash appearing is a fluke from non-sanitized text making it's way into their database. I don't think they're actually trying to escape anything with that backslash. So we'll just have to watch for backslash problems in the future and address it another way if it comes up.

macr0dev commented 6 years ago

.... well. I stumbled across some books that actually have escaped characters. Looks like I'm gonna have to write something the check for backslashes that aren't escaping a legit character that needs it and then escape just that one. Which puts us back to not eliminating NOR escaping all backslashes and looking for specific instances of invalid data....

macr0dev commented 6 years ago

OK. I've resigned for the moment to just remove that special case of backslash-paren for that one book. The new line is still removed, but I'm out of ideas at the moment on how to further sanitize the data before handing it to json.... gonna have to mull on this one.

Oh, and I added ratings in from audible to plex in a separate commit this morning.

macr0dev commented 6 years ago

OK. Finally found built a regex that will remove and backslash UNLESS is is immediately followed by a character that JSON needs to be escaped. That should at least put the 'escaping' issue to bed. Who knows what crazy characters will pop up next.

@dethrophes, want to throw in an opinion on this one?

https://github.com/macr0dev/Audiobooks.bundle/commit/0f5db94a6fb5ca9e3e8c44dae0c5d86978a4fe40#diff-3ee84c02e62336e4581b8f124526c78b

dethrophes commented 6 years ago

according to http://json.org/ I think this would be better

might as well just compile it once. remove_inv_json_esc=re.compile(r'([^\])(\(?![bfnrt\'\"\/]|u[A-Fa-f0-9]{4}))' )

page_content=remove_inv_json_esc.sub(r'\1\\2', page_content)

not perfect but close enough.

On Thu, Nov 9, 2017 at 10:49 PM, macr0dev notifications@github.com wrote:

OK. Finally found built a regex that will remove and backslash UNLESS is is immediately followed by a character that JSON needs to be escaped. That should at least put the 'escaping' issue to bed. Who knows what crazy characters will pop up next.

@dethrophes https://github.com/dethrophes, want to throw in an opinion on this one?

0f5db94#diff-3ee84c02e62336e4581b8f124526c78b https://github.com/macr0dev/Audiobooks.bundle/commit/0f5db94a6fb5ca9e3e8c44dae0c5d86978a4fe40#diff-3ee84c02e62336e4581b8f124526c78b

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/macr0dev/Audiobooks.bundle/issues/21#issuecomment-343302735, or mute the thread https://github.com/notifications/unsubscribe-auth/ABbA81nJGNchzdO1_iCWqjO8864wJznoks5s03N4gaJpZM4QXDot .

macr0dev commented 6 years ago

might as well just compile it once.

Is that supposed to take care of needing to remove the new lines also and remove the need for this line?

page_content = page_content.replace('\n', '')

It seems to work just as well for removing erroneous escapes as mine, but the new line slips by it.

macr0dev commented 5 years ago

Cleaning up old issues. This one is long resolved. Closing.

macr0dev / Audiobooks.bundle

New parser using json.loads aborts on some pages. #21