osome-iu / hoaxy-backend

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking
http://hoaxy.iuni.iu.edu/
GNU General Public License v3.0
139 stars 44 forks source link

Test and revise article parsing pipeline #36

Closed filmenczer closed 5 years ago

filmenczer commented 5 years ago

Test (and also analyze hoaxy backend log) to measure how often one or the other parser fails/succeeds at getting all required fields. Note that if content is indeed empty, that should not be interpreted as a parser error. We may need to revise the pipeline based on these results, eg, switch order, use both and merge fields, or add other parser...

ZacMilano commented 5 years ago

For most fields, a merged strategy of parsing works.

@shaochengcheng, would you please check that all of the above is okay and will not cause problems downstream? In particular, if we set date_published to the empty string, will that cause problems (namely when ranking by recent)?

ZacMilano commented 5 years ago

The plan for merging the two parsers' results is to first use Mercury, and if any fields are missing, try to fill them with newspaper3k.

shaochengcheng commented 5 years ago

@ZacMonroe, the working flow sounds good for me. date_published could be empty for a parsed article in the database. However, when doing index or outputting API service, we should merge it with date_captured column. Our current code has already taken care of it, e.g., lucene index, mostly by utilizing the SQL function coalesce(a.date_published, a.date_captured) AS pd, thus you are ok to do it.

filmenczer commented 5 years ago

If I understand @shaochengcheng 's feedback (thank you!), we need to check that date_published and date_captured are merged when the API returns recent results (sorted by date). Then we can proceed.

However, I believe the sql function coalesce works on the assumption that one of the arguments is NULL. @shaochengcheng, does this mean that if the date_published cannot be found, it should be set to NULL instead of the empty string?

And what about the other two fields (author and dek), should they be set to NULL or empty string?

UPDATE from @shaochengcheng : the fields that are not found (date_published, dek, author) should be set to NULL (not empty string) if not found. For title we should not only check that it exists, but also that it is not empty. Ignore in both cases.

chathuriw commented 5 years ago

I change the code to use mercury parser first and set the default to NULL.