sandeep-sandhu / NewsLookout

The NewsLookout web scraping application with NLP and data pre-processing
GNU General Public License v3.0
3 stars 2 forks source link

Fix logic to extract agency/source/authors for plugin: mod_en_in_inexp_business #8

Open sandeep-sandhu opened 3 years ago

sandeep-sandhu commented 3 years ago

For this plugin, the logic to extract agency/source/authors for the news, extractAuthors() does not consistently capture this information from the HTML content. For example, here the source data was missed from sourceName field but is present in the extracted text body: "sourceName": [""], "pubdate": "2021-07-18", "text": "By PTI\nNEW DELHI: