mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Parser tuning (thread usage in parser language detection) #78

Closed philbudne closed 1 year ago

philbudne commented 1 year ago

As previously posted (on slack):

It look like the (single) parser worker process on tarbell is using (at times) up to 1855% CPU, with the load average (unsurprisingly) between 16 and 19. [tarbell has 32 cores]

capture from "top" displaying threads:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2103397 root      20   0 2083264 640704  16336 R  91.0   0.3 510:51.96 python3 -mindexer.workers.parser
2103740 root      20   0 2083264 640704  16336 S  57.0   0.3 240:22.62 python3 -mindexer.workers.parser
2103746 root      20   0 2083264 640704  16336 S  57.0   0.3 240:06.89 python3 -mindexer.workers.parser
2103742 root      20   0 2083264 640704  16336 S  56.7   0.3 240:18.35 python3 -mindexer.workers.parser
2103753 root      20   0 2083264 640704  16336 S  56.7   0.3 238:56.63 python3 -mindexer.workers.parser
2103755 root      20   0 2083264 640704  16336 S  56.7   0.3 238:46.47 python3 -mindexer.workers.parser
2103757 root      20   0 2083264 640704  16336 S  56.7   0.3 237:56.81 python3 -mindexer.workers.parser
2103747 root      20   0 2083264 640704  16336 S  56.3   0.3 240:00.98 python3 -mindexer.workers.parser
2103750 root      20   0 2083264 640704  16336 S  56.3   0.3 239:34.36 python3 -mindexer.workers.parser
2103758 root      20   0 2083264 640704  16336 S  56.3   0.3 237:38.26 python3 -mindexer.workers.parser
2103745 root      20   0 2083264 640704  16336 S  56.0   0.3 240:12.45 python3 -mindexer.workers.parser
2103739 root      20   0 2083264 640704  16336 S  55.7   0.3 240:22.43 python3 -mindexer.workers.parser
2103743 root      20   0 2083264 640704  16336 S  55.7   0.3 240:19.94 python3 -mindexer.workers.parser
2103736 root      20   0 2083264 640704  16336 S  55.4   0.3 240:33.63 python3 -mindexer.workers.parser
2103737 root      20   0 2083264 640704  16336 S  55.4   0.3 240:21.42 python3 -mindexer.workers.parser
2103741 root      20   0 2083264 640704  16336 S  55.1   0.3 240:26.29 python3 -mindexer.workers.parser
2103744 root      20   0 2083264 640704  16336 S  54.8   0.3 240:14.61 python3 -mindexer.workers.parser
2103751 root      20   0 2083264 640704  16336 S  54.2   0.3 239:21.88 python3 -mindexer.workers.parser
2103752 root      20   0 2083264 640704  16336 S  54.2   0.3 239:23.63 python3 -mindexer.workers.parser
2103754 root      20   0 2083264 640704  16336 S  54.2   0.3 238:23.83 python3 -mindexer.workers.parser
2103738 root      20   0 2083264 640704  16336 S  52.3   0.3 240:23.10 python3 -mindexer.workers.parser
2103756 root      20   0 2083264 640704  16336 S  51.4   0.3 238:25.94 python3 -mindexer.workers.parser
2103759 root      20   0 2083264 640704  16336 S  50.8   0.3 237:48.34 python3 -mindexer.workers.parser
2103748 root      20   0 2083264 640704  16336 S  49.8   0.3 239:48.99 python3 -mindexer.workers.parser
2103749 root      20   0 2083264 640704  16336 S  48.0   0.3 239:35.14 python3 -mindexer.workers.parser

which shows 25 threads, presumably working on a single story?

I suspect this is trafilatura calling py3langid which uses numpy (dot operator?) which uses OpenBLAS (open implementation of Basic Linear Algebra Subprograms) which is multi-threaded.

If this is the case, setting OPENBLAS_NUM_THREADS=n in the parser worker's environment would control how many threads are launched. So long as there are enough CPU cores available (for parser and anything else running on the same server), this isn't necessarily a problem, and limiting the number of threads should make each "parse" take longer, and, if lowered to 1, would almost certainly leave idle cores.

Our CPUs may have hardware thread support, which may be disabled by default, since it's one of MANY exploits that can leak data between processes. It might be worth investigating whether enabling SMT/Hyperthreading, and disabling other kernel mitigations for data leak exploits might yield any benefits (since we're unlikely to be worried about data leakage exploits).

Looking at ramos; seems to have two cpu packages, each with 16 cores, each with two threads, AND it has 64 "processor" entries in /proc/cpuinfo, so SMT may be enabled.

rahulbot commented 1 year ago

@philbudne I think you mentioned in a meeting that this is not actually problem, so we don't need to change this constant. Did I understand correctly, and if so, can we close this?

philbudne commented 1 year ago

Yes, I think it's fine as-is for the moment.

Only time will tell how best to balance things, especially if the queues are backed up, or we start to backfill historical data.