mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

support passing in a fallback publication date #63

Closed rahulbot closed 1 year ago

rahulbot commented 1 year ago

Some of our sources of data include a machine-readable publication date, for instance from an RSS feed. We've found this date to be highly unreliable in the past, but it would be a useful fallback to use in case a date can't otherwise be found in the text.

We should support passing in a default publication date in the extract method, to be used if a date can't be found. Could be solved with same solution as #61 proposes, even though a wrinkle is that for language the passing in value would be an override and in this case with publication_date if would be a fallback... potentially confusing inconsistency.

Related: It would be a useful side-project to have some data to re-assess the match between publication dates supplied in RSS feeds, publication date guessed by this library, and publication date parsed out by a person. That would help us re-assess and support this policy of trusting guessed date over RSS date.

philbudne commented 1 year ago

Question:

Could defaulting of the publication_date be done just as well in the story-indexer "importer"?

Regarding #61: I agree, there are two distinct semantics:

  1. Values to use if analysis fails
  2. Values to use INSTEAD of analysis Both could be implemented as dictionaries, and I think it's a bad idea to have a single dictionary where different members have different semantics!!
rahulbot commented 1 year ago

I dove in and addressed this by adding in overrides and defaults as two separate new params to extract. See #64