mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Queue fetcher updates #266

Closed philbudne closed 8 months ago

philbudne commented 8 months ago
Take another page from scrapy scheduling internals

Override _on_input_message, which runs in the Pika thread when a new
message is received from RabbitMQ, and rather than just queuing the
message to the internal work queue (for consumption by worker
threads), decode it, and see when it could next be started (using the
Slot.issue_interval calculated from avg_seconds for request completion
to keep "next_issue" time).  If the delay is less than the fast delay
queue time (set with --busy-delay-minutes), use the Pika connection
"call_later" method to delay putting the message on the work queue
until it's ripe to be issued.  If the delay is longer than
busy-delay-minutes, requeue the message to the -fast queue.

This GREATLY reduces use of the -fast queue (lower CPU load) AND means
that requests can be started as soon as possible, without waiting for
the message to come around through RabbitMQ (better thruput).

Also: default worker count to the number of available CPU cores.
philbudne commented 8 months ago

Ah, I understand now.... Good catch!! I noted the "Any" issue in February as https://github.com/mediacloud/story-indexer/issues/233 which we put off as "long-term"

It looks like there are 29 uses of Story sub-object "getter" methods in "with" statments (plus a comment in indexer/workers/fetcher/rss-queuer.py that notes the problem!):

# mypy reval_type(rss) in "with s.rss_entry() as rss" gives Any!!

And the explicit hint, does seem to solve the problem:

(venv) ***@***.***:~/story-indexer$ cat a.py
from indexer.story import BaseStory, RSSEntry

def foo(s: BaseStory) -> None:
reveal_type(s.rss_entry())

with s.rss_entry() as r:
    reveal_type(r)

r2: RSSEntry
with s.rss_entry() as r2:
    reveal_type(r2)

r3 = s.rss_entry()
reveal_type(r3)
with r3:
    reveal_type(r3)

(venv) ***@***.***:~/story-indexer$ mypy a.py 
...
a.py:4: note: Revealed type is "indexer.story.RSSEntry"
a.py:6: note: Revealed type is "Any"
a.py:10: note: Revealed type is "indexer.story.RSSEntry"
a.py:13: note: Revealed type is "indexer.story.RSSEntry"
a.py:15: note: Revealed type is "indexer.story.RSSEntry"

I'd like to understand the failure before applying a work-around, and I'd prefer the assignment (since it avoids needing to explicitly type the variable), plus a comment like "mypy gets with ..... wrong" to a variable declaration that looks extraneous.

The problem is easy to reproduce, with a simple class with mypy (but pytype gets it right) so I've asked about it on a python/typing chat.