mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

indexer/workers/parser.py: use lxml iterparse w/ NEED_CANONICAL_URL (PROPOSAL FOR DISCUSION) #342

Closed philbudne closed 1 month ago

philbudne commented 1 month ago

(possible degenerate case: document that doesn't start with a "feed tag", and doesn't have a head section will parse entire document before discarding. Does not try to handle errors: story will be requeued and then quarantined).

Takes < 1ms for either. Tested on documents that take trafilatura and readability an hour of chewing (at 100% CPU) to come up empty.

Timing for one trouble document (read from WARC):

2024-10-27 17:55:37,998 | INFO | parser | parsing 3358172997: 9211438 characters
2024-10-27 17:55:38,320 | ERROR | htmldate.utils | parsed tree length: 1, wrong data type or not valid HTML
2024-10-27 17:55:38,621 | ERROR | trafilatura.utils | parsed tree length: 1, wrong data type or not valid HTML
2024-10-27 17:55:38,645 | ERROR | trafilatura.core | empty HTML tree for URL http://mediacloud.org/need_canonical_url
2024-10-27 17:55:38,674 | WARNING | trafilatura.core | discarding data for url: http://mediacloud.org/need_canonical_url
2024-10-27 18:14:37,467 | INFO | readability.readability | ruthless removal did not work. 
2024-10-27 18:52:17,758 | INFO | indexer.storyapp | feed: 3358172997

Here is the new code on a whole batch of stories left in queue (reading from same WARC file):

(venv) pbudne@bernstein:~/story-indexer$ ./bin/run-parser.sh --log-level debug --test-file-prefix parser-hang --rabbitmq-url x 
log: -t run-parser.sh -p debug invoking indexer.workers.parser --log-level debug --test-file-prefix parser-hang --rabbitmq-url x
2024-10-27 19:50:17,317 | INFO | indexer.app | STATSD_URL not set
2024-10-27 19:50:17,318 | INFO | indexer.sentry | SENTRY_DSN not found. Not logging errors to Sentry
2024-10-27 19:50:17,387 | DEBUG | indexer.app | encoding: 14.5822 ms
2024-10-27 19:50:17,388 | INFO | parser | parsing 3358172997: 9211438 characters
2024-10-27 19:50:17,388 | DEBUG | parser | encoding utf-8, was None
2024-10-27 19:50:17,388 | DEBUG | parser | first tag rss
2024-10-27 19:50:17,388 | INFO | indexer.storyapp | feed: 3358172997
2024-10-27 19:50:17,441 | DEBUG | indexer.app | encoding: 15.3385 ms
2024-10-27 19:50:17,441 | INFO | parser | parsing 3357305750: 9216266 characters
2024-10-27 19:50:17,441 | DEBUG | parser | encoding utf-8, was None
2024-10-27 19:50:17,441 | DEBUG | parser | first tag rss
2024-10-27 19:50:17,441 | INFO | indexer.storyapp | feed: 3357305750
2024-10-27 19:50:17,490 | DEBUG | indexer.app | encoding: 15.205 ms
2024-10-27 19:50:17,490 | INFO | parser | parsing 3357270854: 9209002 characters
2024-10-27 19:50:17,490 | DEBUG | parser | encoding utf-8, was None
2024-10-27 19:50:17,491 | DEBUG | parser | first tag rss
2024-10-27 19:50:17,491 | INFO | indexer.storyapp | feed: 3357270854
2024-10-27 19:50:17,539 | DEBUG | indexer.app | encoding: 15.1761 ms
2024-10-27 19:50:17,539 | INFO | parser | parsing 3357170719: 9208027 characters
2024-10-27 19:50:17,539 | DEBUG | parser | encoding utf-8, was None
2024-10-27 19:50:17,541 | DEBUG | parser | first tag rss
2024-10-27 19:50:17,541 | INFO | indexer.storyapp | feed: 3357170719
2024-10-27 19:50:17,565 | DEBUG | indexer.app | encoding: 1.38639 ms
2024-10-27 19:50:17,565 | INFO | parser | parsing 3356828136: 7785501 characters
2024-10-27 19:50:17,565 | DEBUG | parser | encoding iso-8859-1, was None
2024-10-27 19:50:17,566 | DEBUG | parser | first tag rss
2024-10-27 19:50:17,566 | INFO | indexer.storyapp | feed: 3356828136
2024-10-27 19:50:17,615 | DEBUG | indexer.app | encoding: 17.4459 ms
2024-10-27 19:50:17,615 | INFO | parser | parsing 3356789661: 7233451 characters
2024-10-27 19:50:17,615 | DEBUG | parser | encoding utf-8, was None
2024-10-27 19:50:17,616 | DEBUG | parser | first tag rss
2024-10-27 19:50:17,616 | INFO | indexer.storyapp | feed: 3356789661
2024-10-27 19:50:17,661 | DEBUG | indexer.app | encoding: 13.8607 ms
2024-10-27 19:50:17,661 | INFO | parser | parsing 3356670306: 7233058 characters
2024-10-27 19:50:17,661 | DEBUG | parser | encoding utf-8, was None
2024-10-27 19:50:17,661 | DEBUG | parser | first tag rss
2024-10-27 19:50:17,661 | INFO | indexer.storyapp | feed: 3356670306
2024-10-27 19:50:17,661 | INFO | indexer.storyapp | processed 7 stories
2024-10-27 19:50:17,662 | DEBUG | indexer.app | main_loop: 344.585 ms
log: -t run-parser.sh -p debug indexer.workers.parser --log-level debug --test-file-prefix parser-hang --rabbitmq-url x exit status 0