miso-belica / jusText

Heuristic based boilerplate removal tool
https://pypi.python.org/pypi/jusText
BSD 2-Clause "Simplified" License
719 stars 80 forks source link
html-parser html-parsing python text-extraction

.. _jusText: http://code.google.com/p/justext/ .. _Python: http://www.python.org/ .. _lxml: http://lxml.de/

jusText

.. image:: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml/badge.svg :target: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed <doc/algorithm.rst> to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can try it online <http://nlp.fi.muni.cz/projects/justext/>.

This is a fork of original (currently unmaintained) code of jusText_ hosted on Google Code.

Adaptations of the algorithm to other languages:

Some libraries using jusText:

Some currently (Jan 2020) maintained alternatives:

Installation

Make sure you have Python 2.7+/3.5+ and pip <https://pip.pypa.io/en/stable/> (Windows <http://docs.python-guide.org/en/latest/starting/install/win/>, Linux <http://docs.python-guide.org/en/latest/starting/install/linux/>) installed. Run simply:

.. code-block:: bash

$ [sudo] pip install justext

Dependencies

::

lxml (version depends on your Python version)

Usage

.. code-block:: bash

$ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ $ python -m justext -s English -o plain_text.txt english_page.html $ python -m justext --help # for more info

Python API

.. code-block:: python

import requests import justext

response = requests.get("http://planet.python.org/") paragraphs = justext.justext(response.content, justext.get_stoplist("English")) for paragraph in paragraphs: if not paragraph.is_boilerplate: print paragraph.text

Testing

Run tests via

.. code-block:: bash

$ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9

Acknowledgements

.. Natural Language Processing Centre: http://nlp.fi.muni.cz/en/nlpc .. Masaryk University in Brno: http://nlp.fi.muni.cz/en .. PRESEMT: http://presemt.eu/ .. Lexical Computing Ltd.: http://lexicalcomputing.com/ .. _PhD research: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd. It also relates to PhD research_ of Jan Pomikálek.