Closed dethrophes closed 5 years ago
what a weird book description. Boy they've really snarfed things up with these new pages though.
Jeez, well - I suppose we could try to sanitize it before it gets to the json parser. But by the time you can do that we might as well go back to using the regex and search functions.
Tests are looking good with the changes below. I just moved the data over to a regular variable and replaced the \ with nothing. If this gets out of hand, we might need a sub for sanitizing different things. Gonna commit this later.
if date is None : for r in html.xpath('//script[contains (@type, "application/ld+json")]'): page_content = r.text_content() page_content = page_content.replace('\\', '') json_data=json_decode(page_content)
Commit is done and tested on my production box.
https://github.com/macr0dev/Audiobooks.bundle/commit/9a5462159fc5b2bfae6f1d5e6a5439749dd78678
You can't really just remove all escapes, they are valid sequences.. so I'd be reluctant to include this change,
On Thu, Nov 9, 2017 at 12:46 AM, macr0dev notifications@github.com wrote:
Commit is done and tested on my production box.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/macr0dev/Audiobooks.bundle/issues/21#issuecomment-342999791, or mute the thread https://github.com/notifications/unsubscribe-auth/ABbA8yBfbkCAXRfQQ51h70i8DmsR7QCmks5s0j1KgaJpZM4QXDot .
well, it was a short lived victory anyway. It's fixed the book you found it in - but I'm starting to see problems with other books. Both with and without the fix I came up with.
OK, did some digging and found the other problem and made a few more changes.
1) I've switched from just removing the backslash to escaping it with another backslash.
2) I've discovered that some of the descriptions have a hidden '\n' in them. So I'm and just straight removing it.
I took the source from the pages and ran them through a JSON validator. Those new lines were definitely breaking things, they just aren't in every page. Escaping the backslash makes the validator happy so that seems to be good. Honestly I think the backslash appearing is a fluke from non-sanitized text making it's way into their database. I don't think they're actually trying to escape anything with that backslash. So we'll just have to watch for backslash problems in the future and address it another way if it comes up.
OK. I've resigned for the moment to just remove that special case of backslash-paren for that one book. The new line is still removed, but I'm out of ideas at the moment on how to further sanitize the data before handing it to json.... gonna have to mull on this one.
Oh, and I added ratings in from audible to plex in a separate commit this morning.
OK. Finally found built a regex that will remove and backslash UNLESS is is immediately followed by a character that JSON needs to be escaped. That should at least put the 'escaping' issue to bed. Who knows what crazy characters will pop up next.
@dethrophes, want to throw in an opinion on this one?
according to http://json.org/ I think this would be better
might as well just compile it once. remove_inv_json_esc=re.compile(r'([^\])(\(?![bfnrt\'\"\/]|u[A-Fa-f0-9]{4}))' )
page_content=remove_inv_json_esc.sub(r'\1\\2', page_content)
not perfect but close enough.
On Thu, Nov 9, 2017 at 10:49 PM, macr0dev notifications@github.com wrote:
OK. Finally found built a regex that will remove and backslash UNLESS is is immediately followed by a character that JSON needs to be escaped. That should at least put the 'escaping' issue to bed. Who knows what crazy characters will pop up next.
@dethrophes https://github.com/dethrophes, want to throw in an opinion on this one?
0f5db94#diff-3ee84c02e62336e4581b8f124526c78b https://github.com/macr0dev/Audiobooks.bundle/commit/0f5db94a6fb5ca9e3e8c44dae0c5d86978a4fe40#diff-3ee84c02e62336e4581b8f124526c78b
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/macr0dev/Audiobooks.bundle/issues/21#issuecomment-343302735, or mute the thread https://github.com/notifications/unsubscribe-auth/ABbA81nJGNchzdO1_iCWqjO8864wJznoks5s03N4gaJpZM4QXDot .
might as well just compile it once.
Is that supposed to take care of needing to remove the new lines also and remove the need for this line?
page_content = page_content.replace('\n', '')
It seems to work just as well for removing erroneous escapes as mine, but the new line slips by it.
Cleaning up old issues. This one is long resolved. Closing.
The json parser doesn't seem to cope with some of the pages. e.g. https://www.audible.com/pd/B00IZOP8CI
Invalid \escape: line 11 column 263 (char 443)
specifically it seems to be the
\)
in the following., "description": "<p>Do you know why…</p><p>a mortgage is literally a death pledge? …why guns have girls’ names? …why salt is related to soldier?</p> You’re about to find out…<?-‘mä-lä-ji-kän) is:</p><p>*Witty (wi-te\): Full of clever humor</p><p>*Erudite (er-?-dit): Showing knowledge</p><p>*Ribald (ri-b?ld): Crude, offensive</p><p><i>The Etymologicon</i> is a ce strange underpinnings of the English language. It explains: How you get from “gruntled” to “disgruntled”; why you are absolutely right to believe that your meager salary barely covers “main of coffee shops in the world (hint: Seattle) connects to whaling in Nantucket; and what precisely the Rolling Stones have to do with gardening. </p>"