Closed filmenczer closed 5 years ago
For most fields, a merged strategy of parsing works.
@shaochengcheng, would you please check that all of the above is okay and will not cause problems downstream? In particular, if we set date_published to the empty string, will that cause problems (namely when ranking by recent)?
The plan for merging the two parsers' results is to first use Mercury, and if any fields are missing, try to fill them with newspaper3k.
@ZacMonroe, the working flow sounds good for me. date_published
could be empty for a parsed article in the database. However, when doing index or outputting API service, we should merge it with date_captured
column. Our current code has already taken care of it, e.g., lucene index, mostly by utilizing the SQL function coalesce(a.date_published, a.date_captured) AS pd
, thus you are ok to do it.
If I understand @shaochengcheng 's feedback (thank you!), we need to check that date_published
and date_captured
are merged when the API returns recent results (sorted by date). Then we can proceed.
However, I believe the sql function coalesce
works on the assumption that one of the arguments is NULL. @shaochengcheng, does this mean that if the date_published
cannot be found, it should be set to NULL instead of the empty string?
And what about the other two fields (author
and dek
), should they be set to NULL or empty string?
UPDATE from @shaochengcheng : the fields that are not found (date_published
, dek
, author
) should be set to NULL (not empty string) if not found. For title
we should not only check that it exists, but also that it is not empty. Ignore in both cases.
I change the code to use mercury parser first and set the default to NULL.
Test (and also analyze hoaxy backend log) to measure how often one or the other parser fails/succeeds at getting all required fields. Note that if content is indeed empty, that should not be interpreted as a parser error. We may need to revise the pipeline based on these results, eg, switch order, use both and merge fields, or add other parser...