historical ingest: use language from CSV files?

At one point I thought that language determination was a CPU sink because of high load averages(*); @thepsalmist re-created CSV files to include a language column, and @rahulbot added an override argument to mcmetadata.extract to take arguments to prevent regeneration of fields, and I made hist-queuer.py save the CSV language column (when available) into ContentMetadata language and full_language fields, BUT the parser does not YET pass the data through to mcmetadata.extract.

Should it?

A possible negative is that both it seems like the CSV language column may always be a "short" language, like "eng", as opposed to a country qualified one like us-eng, and it turns out language detection wasn't a time sink.

A possible implementation could be to pass {k:v for k,v in content_metadata.as_dict().items() if v is not None} as the override argument (which would would mean the parser would always honor old data, unless we added a way to say which fields to keep or ignore).

I'm tempted to ignore the CSV column!

(*) but it turned out to be idle worker threads spinning looking for work, and was cured by setting OPENBLAS_NUM_THREADS=1 in the environment, disabling threading (in the openblas "dot" operator used by numpy used by language detection code).

mediacloud / story-indexer

historical ingest: use language from CSV files? #222