j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

Fetching more than 20 sections? #205

Open mart-e opened 5 years ago

mart-e commented 5 years ago

Hello,

I am trying to grab a very long text: The birth and death of a bike company: What happened to SpeedX? | CyclingTips

While trying to fetch it on wallabag or f43.me, I get only the first 20 sections of the content.

The end of the grabbed content is:

If you ever want to know how it feels when you’re in a start-up that goes bust – when hundreds of people lose their jobs and everything they’ve worked so hard to build – just ask a former Bluegogo or SpeedX employee. They’ll tell you; it feels like total despair.

Or, if you check the page source, is in the class et_pb_section_19 (starting a 0), so I assume the parser stops after 20 block of content (but maybe it is unrelated).

EDIT: Not really 20 sections, just that only the 20th is actually grabbed, investigating a bit more

I tried using a custom siteconfig but got the same result

title: //head/title
body: //div[hasclass('et_builder_outer_content')]

Screenshot of result

Any idea?

j0k3r commented 5 years ago

Checking the debug log tab on f43.me and it seems content are truncated during the cleanupHtml but it's weird the content is really small in that log line ... :thinking:

mart-e commented 5 years ago

Thanks for looking into it. I was playing around with the xpath and using the id to grab the text seems to work better

title: //head/title
body: //div[@id='et_builder_outer_content']

It is strange has the class is only present once so it should be similar right?

j0k3r commented 5 years ago

Should be, yep.

mart-e commented 5 years ago

So, comparing both xpath, I get: Using hasclass:

Using @id=:

So I guess the following code is executed and Readability is used as a fallback (and has a bad parser)

https://github.com/j0k3r/graby/blob/39e9a8b687503fc030d4202f7f04e2e6418cef57/src/Extractor/ContentExtractor.php#L517-L527

But why did the XPath returned 1 for the second expression but not the first one?

j0k3r commented 5 years ago

Are you sure about the quoted code?