mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

historical ingest: use language from CSV files? #222

Closed philbudne closed 9 months ago

philbudne commented 10 months ago

At one point I thought that language determination was a CPU sink because of high load averages(*); @thepsalmist re-created CSV files to include a language column, and @rahulbot added an override argument to mcmetadata.extract to take arguments to prevent regeneration of fields, and I made hist-queuer.py save the CSV language column (when available) into ContentMetadata language and full_language fields, BUT the parser does not YET pass the data through to mcmetadata.extract.

Should it?

A possible negative is that both it seems like the CSV language column may always be a "short" language, like "eng", as opposed to a country qualified one like us-eng, and it turns out language detection wasn't a time sink.

A possible implementation could be to pass {k:v for k,v in content_metadata.as_dict().items() if v is not None} as the override argument (which would would mean the parser would always honor old data, unless we added a way to say which fields to keep or ignore).

I'm tempted to ignore the CSV column!

(*) but it turned out to be idle worker threads spinning looking for work, and was cured by setting OPENBLAS_NUM_THREADS=1 in the environment, disabling threading (in the openblas "dot" operator used by numpy used by language detection code).

rahulbot commented 10 months ago

The grants we've put in argue that in the intervening years since we first built the legacy tool the underlying technologies for metadata extraction (including language detection) have improved. That rationale implies that we should ignore CSV metadata where prudent to do so, in order to gain the claimed improvement in results.