While running minet extract report.csv > articles.csv, dragnet errors lead the extract program to crash :
Extracting content: 4484 docs [00:27, 173.25 docs/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "dragnet/blocks.pyx", line 846, in dragnet.blocks.Blockifier.blockify
File "src/lxml/parser.pxi", line 1689, in lxml.etree.HTMLParser.__init__
File "src/lxml/parser.pxi", line 823, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/salome/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/salome/miniconda3/lib/python3.7/site-packages/minet/cli/extract.py", line 49, in worker
content = extract_content(raw_html)
File "/home/salome/miniconda3/lib/python3.7/site-packages/dragnet/__init__.py", line 13, in extract_content
return _LOADED_MODELS['content'].extract(html, encoding=encoding, as_blocks=as_blocks)
File "/home/salome/miniconda3/lib/python3.7/site-packages/dragnet/extractor.py", line 169, in extract
preds, blocks = self.predict(html, encoding=encoding, return_blocks=True)
File "/home/salome/miniconda3/lib/python3.7/site-packages/dragnet/extractor.py", line 189, in predict
return self._predict_one(documents, **kwargs)
File "/home/salome/miniconda3/lib/python3.7/site-packages/dragnet/extractor.py", line 207, in _predict_one
blocks = self.blockifier.blockify(document, encoding=encoding)
File "dragnet/blocks.pyx", line 887, in dragnet.blocks.TagCountNoCSSReadabilityBlockifier.blockify
File "dragnet/blocks.pyx", line 849, in dragnet.blocks.Blockifier.blockify
dragnet.blocks.BlockifyError: Could not blockify HTML
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/salome/miniconda3/bin/minet", line 10, in <module>
sys.exit(main())
File "/home/salome/miniconda3/lib/python3.7/site-packages/minet/cli/__main__.py", line 187, in main
fn(args)
File "/home/salome/miniconda3/lib/python3.7/site-packages/minet/cli/extract.py", line 86, in extract_action
for error, line, content in pool.imap_unordered(worker, files):
File "/home/salome/miniconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
dragnet.blocks.BlockifyError: Could not blockify HTML
Can you give the CSV line producing this error (you can use -p 1 to use only one CPU and process the file sequentially to make it easier to find the culprit).
While running
minet extract report.csv > articles.csv
, dragnet errors lead the extract program to crash :Thank you