limitations of NER - Githubissues

After a great description of what NER does and the algorithm you designed to use it, the article ultimately explains that you used a "direct hit" method to calculate numbers:

We ended up just implementing the "direct hit" part of the algorithm, under the assumption that this would underestimate the true state reference numbers but not bias any particular states, or bordering states vs the local state within a corpus. This means the counts we used to produce the maps are only location names that contain the state name or abbreviation.

A reader may wonder, given this, if the same counts could have been achieved simply by searching each ad for terms that occur in the hard-coded list you have in your script, thereby bypassing NER entirely.

As a technical issue, it might be good to clarify this for the reader wondering whether NER would be essential to perform the task you performed in the end---making the GF maps. To point this out would be helpful if only because it shows some of the difficulties of using NER for a problem like this---that is, you may have hit not only upon a limitation of time, but a partial limitation in NER itself to do this kind of historical work.

Of course, at the same time you could point out why getting the full algorithm to work would still be preferable to the "direct hits" alone, as it would capture county and locality names that you couldn't have predicted would be in the corpus. For example, if you had just searched for state names, you wouldn't have found Mexico. And while you could just add Mexico to the hit list, what if you were dealing with a corpus containing place names you hadn't thought of at all? While you could keep making the hit list longer and longer, at some point a program like NER becomes more efficient---so pointing out the potential of this method is the flip side of noting its limits.

ricedh / drafts

limitations of NER #25