mozilla / readability

A standalone version of the readability lib
Other
8.62k stars 588 forks source link

Lifehacker article missing lots of content #76

Closed leibovic closed 9 years ago

leibovic commented 9 years ago

http://lifehacker.com/how-to-program-your-mind-to-stop-buying-crap-you-don-t-1690268064

n1k0 commented 9 years ago

I don't reproduce, here's what I get; what's missing on your end?

I wish we got a much cleaner HTML, I'm actually finding that we're still including too much here :)

n1k0 commented 9 years ago

Just tested with latest fx-team, lots of contents missing indeed; though that's fixed entirely with updating the Readability.js version from latest master.

leibovic commented 9 years ago

It's still not working for me on the latest fx-team, which includes the latest Readability.js changes. I wonder what could be happening differently for us.

pdehaan commented 9 years ago

Here's what I see on blank_skitch_document

how_to_program_your_mind_to_stop_buying_crap_you_don t_need

pdehaan commented 9 years ago

I'm using today's FF Nightly, but is there any way to have Readability.js spit out a version number? Or maybe we don't have some Readability.VERSION constant...

pdehaan commented 9 years ago

I can't explain it, but it seemed to render fine/better in a different tab, but the original tab still shows the bad output and "Flagged". :uber_shrug:

how_to_program_your_mind_to_stop_buying_crap_you_don t_need

n1k0 commented 9 years ago

I'm getting this (fx-team)

screen shot 2015-03-24 at 22 59 38

Now that's worrying if we all get different results… :x

n1k0 commented 9 years ago

OH reproduced your issue by opening the link several times in new tabs

screen shot 2015-03-24 at 23 05 52

Huuuuuuuuuh that doesn't sound good at all…

n1k0 commented 9 years ago

The Flagged html elements are

<p class="js_reply-flagged reply-flagged hide">Flagged</p>

I couldn't find this in original article html, though I could find references of a FlaggedReplies module in their code; so my suspicion is that it's generated by their js :/

gijsk commented 9 years ago

Locally I made Firefox save a broken and working HTML bit for this page, but test-generate.js chokes on the broken case with actual errors (rather than producing the "wrong" output from the screenshot, as Firefox did). I haven't yet figured out how/why. :-\

gijsk commented 9 years ago

(oh, and then I tried to pastebin these files so everyone could have a look, and ./mach pastebin failed with encoding errors..)

gijsk commented 9 years ago

The pull request seems to also fix the issue reported here, at least for me.

n1k0 commented 9 years ago

Using very latest Readability.js master in very latest fx-team, I'm now getting Failed to load article from page for that URL with this error:

*************************
A coding exception was thrown and uncaught in a Task.

Full message: TypeError: node.parentNode is null
Full stack: Readability.prototype._grabArticle@resource://gre/modules/reader/Readability.js:545:13
Readability.prototype.parse@resource://gre/modules/reader/Readability.js:1594:26
Agent.parseDocument@resource://gre/modules/reader/ReaderWorker.js:48:12
worker.dispatch@resource://gre/modules/reader/ReaderWorker.js:21:24
anonymous/AbstractWorker.prototype.handleMessage@resource://gre/modules/workers/PromiseWorker.js:122:16
@resource://gre/modules/reader/ReaderWorker.js:35:41

*************************
n1k0 commented 9 years ago

Damnit, it seems to be intermittent :/

gijsk commented 9 years ago

Are you using the latest JSDOMParser.js copy as well? ( https://github.com/gijsk/readability/commit/bc7873d3132600ba2a09a3249bc314ab8b840488 was meant to fix the parentNode issue)

n1k0 commented 9 years ago

I've updated it and it seems to have fixed the parentNode issue, indeed :)

I can still get the Failed to load article from page error randomly, for some other reason… Investigating it.

n1k0 commented 9 years ago

OK too hard to reproduce consistently, and when I do, the catch clause in ReaderMode#_readerParse is never visited. The only messages I could spot are:

Reader: Not parsing URI scheme: about
Reader: Reader mode disabled for URI

And as parseDocument returns null in such a case, the generic error message is displayed. Dunno if that's an actual issue, thoughts?

gijsk commented 9 years ago

I mean, if we think the article should be reader-able, I would say it's an issue if we're returning null... can you pastebin / gist / "upload" a copy of the (Firefox-XML-serialized) HTML that's failing?

n1k0 commented 9 years ago

As I poorly tried to explain, in that case we're not even reaching the moment we're serializing the document, because parseDocument simply returns null if the URI scheme isn't supported. Which is annoying while you're trying to get readable content from a webpage (which isn't probably loaded just yet when this occurs, I suspect).

But maybe I don't properly understand the flow here?

gijsk commented 9 years ago

parseDocument is the readability-invoking-bit. I want the stringified document it gets passed, ie serializedDoc from http://hg.mozilla.org/mozilla-central/annotate/e5b72a8edb82/toolkit/components/reader/ReaderMode.jsm#l248 .

gijsk commented 9 years ago

More high-level, those error messages don't make a lot of sense... are these the things you're seeing when you're reproducing this? Because I didn't think that'd show the "failure" case - it basically implies we're trying to invoke reader mode either on about:blank/about:newtab which is loaded before the site is loaded, or on about:reader (ie we try to reader-ize the page twice or something). Neither of these should be happening...

gijsk commented 9 years ago

This WFM on current trunk. @n1k0 can you clarify if this is fixed for you or not?

n1k0 commented 9 years ago

Hmm, nope, couldn't reproduce using very latest versions. I think we can safely close this.