idio / jusText

Heuristic based boilerplate removal tool
https://pypi.python.org/pypi/jusText
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Treat lists as big paragraphs #3

Open idio-devops opened 9 years ago

idio-devops commented 9 years ago

Assembla: https://www.assembla.com/spaces/idio-platform/tickets/5199 Reporter: @keynmol

Given that the number of pages that have missing elements in lists is increasing, it's worth revisiting the algorithm vanilla jusText uses to classify list elements(it does it separately).

The idea was briefly described here: https://github.com/idio/jusText/issues/1 and successfully tested in a similar concept in Mal's JS scraper.

idio-devops commented 9 years ago

/cc @keynmol, @pierslowe

idio-devops commented 9 years ago

At 2015-04-13T08:35:53.000+00:00, @keynmol said:

On Hold because a simple solution greatly reduces the precision of paragraph classification due to various list-based navigations and other website elements.

Clumping list elements into one big super-element also affects classification of surrounding paragraphs, which screws up everything.