mark-williamson / LatNews-Scraper

0 stars 0 forks source link

Text analysis to-do #3

Open mark-williamson opened 6 years ago

mark-williamson commented 6 years ago

Most of this just makes sense to do in a separate file rather than at-source:

  1. Remove line breaks and whitespace from text using NLTK (e.g. '\n', '\t')
  2. Identify countries from either title or in-text
  3. Filter out certain articles?
    • e.g. based on titles like 'Tracking Trends', 'Leader', 'Quotes of the Week'
      • ... idk
  4. Some stories begin with 'COUNTRY NAME |' and this could be removed before analysis
mark-williamson commented 6 years ago

Re: 3. ... there are some data-based articles that have no content title things like "Exchange rates" or "inflation in the region" that can probably by dropped

mark-williamson commented 6 years ago

Re: 1. ... This includes both body and title text. Some titles include whitespace markers as well.

mark-williamson commented 5 years ago
  1. Take a look at very short stories and consider removing. Noticing in 2003 there are some brief snippets of one or two sentences that maybe do not warrant full article.
mark-williamson commented 5 years ago

5a. Some additional points from pre-2003 stories: