mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Discuss possible enhancements to mcmetadata.extract #61

Closed philbudne closed 1 year ago

philbudne commented 1 year ago

To paraphrase Arlo Guthrie's song about Thanksgiving Dinner, I thought I'd create one big issue, rather than several small ones!

Here are some enhancements to mcmetadata.extract that the story-indexer parser could make use of:

  1. A way to pass in language, or indicate language detection shouldn't be run.
  2. A way to get timings out for report to graphite/grafana.

For the first, if there are other fields that we might possibly want to make processing for optional, a way to do this would be to pass in a optional parameter of typing Mapping[str,Any] of pre-determined values (rather than end up with a looooong list of optional initial values).

For the second, a way to do this would be to have an optional parameter timing_stats (of type Mapping[str,float]?) that defaults to the global stats variable.

Both of these would be backwards compatible enhancements to the API, and should result in a new minor version release.

rahulbot commented 1 year ago

@philbudne - can you describe your proposed solution to (2) on your list above in more detail? I didn't understand what you meant by:

For the second, a way to do this would be to have an optional parameter timing_stats (of type Mapping[str,float]?) that defaults to the global stats variable.

philbudne commented 1 year ago

@rahulbot wrote:

@philbudne - can you describe your proposed solution to (2) on your list above in more detail? I didn't understand what you meant by:

For the second, a way to do this would be to have an optional parameter timing_stats (of type Mapping[str,float]?) that defaults to the global stats variable.

Long version:

Right now the timing information is summed into the global "stats" dict across all calls to "extract".

I would like to be able to have the story-indexer "parser" report the timing for each step of each story as "timings" (right now the aggregate data is available various calculated measures like mean, median, min, max, stddev, but could be reported in a histogram instead), so it's possible to have grafana panels that show where time is being eaten. This is of even greater interest, since I did system profiling yesterday that seems to show that OpenBLAS (used by NumPy, used by pylang3) may be racking up run-time in spinlocks (thread code that atomically test/sets, and yields the CPU if the lock wasn't acquired).

My suggestion is to allow the user to supply a dict, or Counter Mapping type to sum into instead of the global stats variable, by adding an optional argument to "extract" that defaults to the global stats variable.

Alternatives:

Add an Optional parameter for a Mapping (dict or Counter) that defaults to None, and is incremented (or directly set) in addition to summing to the global stats, if provided. But would require more code (either an "if" + another line of code for each timing reported, or a local function/closure to do the incrementing).

The desired effect COULD be had without adding the parameter by:

  1. adding a call in mcmetadata to reset stats to zeros, or
  2. zeroing stats in the caller.

Teeeeeeny nit: For the first timing, t1 is not needed, the line at https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/__init__.py#L54 could use t0 (or at least t1 could be set from t0 instead of calling time.time() again).

Even teeeenier: Since Python 3.5 time.monotonic() is available: https://docs.python.org/3/library/time.html#time.monotonic

It's supposed to return actual elapsed seconds (typically system uptime, but officially the epoch is not defined), not the local system time returned by time.time(), which can jump backwards if it was set incorrectly to a future date, and then is fixed. time.time() can also tick faster or slower due to slewing by the adjtime(x) system call(s), as used by NTP time-keeping daemons like ntpd or chronyd.