medialab / minet

A webmining CLI tool & library for python.
GNU General Public License v3.0
285 stars 26 forks source link

No handling of dragnet errors #74

Closed sally14 closed 5 years ago

sally14 commented 5 years ago

While running minet extract report.csv > articles.csv, dragnet errors lead the extract program to crash :

Extracting content: 4484 docs [00:27, 173.25 docs/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "dragnet/blocks.pyx", line 846, in dragnet.blocks.Blockifier.blockify
  File "src/lxml/parser.pxi", line 1689, in lxml.etree.HTMLParser.__init__
  File "src/lxml/parser.pxi", line 823, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/salome/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/salome/miniconda3/lib/python3.7/site-packages/minet/cli/extract.py", line 49, in worker
    content = extract_content(raw_html)
  File "/home/salome/miniconda3/lib/python3.7/site-packages/dragnet/__init__.py", line 13, in extract_content
    return _LOADED_MODELS['content'].extract(html, encoding=encoding, as_blocks=as_blocks)
  File "/home/salome/miniconda3/lib/python3.7/site-packages/dragnet/extractor.py", line 169, in extract
    preds, blocks = self.predict(html, encoding=encoding, return_blocks=True)
  File "/home/salome/miniconda3/lib/python3.7/site-packages/dragnet/extractor.py", line 189, in predict
    return self._predict_one(documents, **kwargs)
  File "/home/salome/miniconda3/lib/python3.7/site-packages/dragnet/extractor.py", line 207, in _predict_one
    blocks = self.blockifier.blockify(document, encoding=encoding)
  File "dragnet/blocks.pyx", line 887, in dragnet.blocks.TagCountNoCSSReadabilityBlockifier.blockify
  File "dragnet/blocks.pyx", line 849, in dragnet.blocks.Blockifier.blockify
dragnet.blocks.BlockifyError: Could not blockify HTML
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/salome/miniconda3/bin/minet", line 10, in <module>
    sys.exit(main())
  File "/home/salome/miniconda3/lib/python3.7/site-packages/minet/cli/__main__.py", line 187, in main
    fn(args)
  File "/home/salome/miniconda3/lib/python3.7/site-packages/minet/cli/extract.py", line 86, in extract_action
    for error, line, content in pool.imap_unordered(worker, files):
  File "/home/salome/miniconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
dragnet.blocks.BlockifyError: Could not blockify HTML

Thank you

Yomguithereal commented 5 years ago

Can you give the CSV line producing this error (you can use -p 1 to use only one CPU and process the file sequentially to make it easier to find the culprit).