ricedh / drafts

Working drafts of articles for final project
http://ricedh.github.io
6 stars 0 forks source link

embedded GFT maps in NER #27

Closed KaitlynSisk closed 10 years ago

KaitlynSisk commented 10 years ago

Dr. McDaniel, I just wanted to make sure that the embedded maps are working on site. I put text underneath each map telling which state it shows, but I don't know if it looks okay because it doesn't show up on GitHub.

wcaleb commented 10 years ago

I've updated the live website so that you can see that they are working. Two changes I made:

@KaitlynSisk Speaking of captions, it may be best to make the captions slightly more descriptive; the caption and map key title are somewhat ambiguous---would something like "Number of mentions of states in Texas ad corpus, 1835-1860" be more accurate?

KaitlynSisk commented 10 years ago

Yeah, definitely! I was having a hard time coming up with the wording, but that sounds great!

KaitlynSisk commented 10 years ago

For the map key title, I want to keep it short. Otherwise, it stretches out the map key and hides part of the map. Do you have any suggestions for that?

wcaleb commented 10 years ago

Great, can you change the captions accordingly?

Looking at the source spreadsheets, I'm not a little bit confused about what the maps are showing.

For example, is the Texas map showing:

Clarification on this will help me suggestion caption titles. Looking at the spreadsheet for Texas I see now that taking out mentions of Texas would increase the percentages in almost every state by a factor of three. Why is that?

KaitlynSisk commented 10 years ago

When we were still figuring out how to approach the GFT, I wanted to make sure we had the option to use the following data sets per corpus:

Sorry it's confusing. The third bullet was originally an idea to show how connected the corpora were, but we decided not to do that because our hypotheses focused more on the fact that Texas ads were self-referential. The data being used in shading the states is the second bullet point: the percent of the total number of times a state was mentioned out of the total number of counted mentions in the corpus.

Would it be beneficial to delete the data from the third bullet point?

wcaleb commented 10 years ago

I think you could clarify things just by renaming the columns to "Count / Total * 100" and "Count / Total without TX/AR/MS * 100."

Now that I better understand what the figures mean, I'm a little bit confused, too, by the low count of Mexico in the Texas corpus. If I just search for the word Mexico in the combined Texas ads, I get 19 results, rather than 3.

cc/ @ClareCat21 @br0nstein

KaitlynSisk commented 10 years ago

Yeah, I found that odd as well. I'll change the columns, though.

wcaleb commented 10 years ago

@br0nstein Did your implementation of the algorithm include "Mexico" in the list of known "states" when checking for hits?

wcaleb commented 10 years ago

Sorry to pile on with the questions, but in light of the above, this current line in the NER page might confuse or mislead readers:

Going back to our original problem of counting the number of ads that referenced each state for each of our Arkansas, Mississippi, and Texas corpora, we can use the above technique to help compute that.

It sounds like you are not counting ads that reference each state; you are just counting number of mentions of each state in the entire corpus. Unless I'm still misunderstanding ...

KaitlynSisk commented 10 years ago

You're correct. We're counting mentions (and I think there's a limit of number of mentions per ad that we counted? @br0nstein ), not number of ads that mention other states.

Also, the columns have been changed. If they're not showing up now, they should soon. I know GFT takes a few minutes to update.

br0nstein commented 10 years ago

@wcaleb

Now that I better understand what the figures mean, I'm a little bit confused, too, by the low count of Mexico in the Texas corpus. If I just search for the word Mexico in the combined Texas ads, I get 19 results, rather than 3.

@br0nstein Did your implementation of the algorithm include "Mexico" in the list of known "states" when checking for hits?

Yes it did. But how it works is that it just searches the named entity recognition results, rather than the full text of each ad. Which, in retrospect, makes little sense. The problem was that Named Entity Recognition only detected 3 occurrences of "Mexico" as location entities in the first place. I will revise the algorithm to take this into account. Should we also re-run the state counts script to get more accurate numbers?

It sounds like you are not counting ads that reference each state; you are just counting number of mentions of each state in the entire corpus. Unless I'm still misunderstanding ...

I should have clarified how sets work. When the script counts locations, for each ad it makes a set holding all states found. Since sets don't allow duplicates, that means the set reflects which states the ad references, not the number of times it references each. So, the overall state count numbers reflect the number of ads that mention (one or more times) each state in the U.S. and Mexico for each corpus.

Does that clear up the confusion, or did I perhaps make an error in my logic?

wcaleb commented 10 years ago

I think this discussion has raised two issues:

  1. You need to go back over the essay very carefully and make sure that all of your descriptions of what the "counts" represent are consistent. At the very least that means making clearer throughout what the "counts" are---number of ads in a given corpus mentioning a given state. This may also require you to do a little bit more description of your algorithm, translating it into prose. Not everyone will be familiar with set notation, so you may not be able to assume that this way of outlining the algorithm will be clear ("Notice how first …"). Remember our audience: historians who may not know much about programming but are starting to learn.
  2. The separate issue is whether to redo the counts, which is a more complicated question.

As I understand it, if you were to change the algorithm as suggested in #28, then you would essentially be bypassing NER altogether and just looking for a list of state words and abbreviations in each ad. It may be telling---and important to note---that this method would turn up such different results in the case of Mexico, and perhaps in the case of other states as well.

I think rather than trying to get a perfect count, the best thing to do would be to revise the "Conclusions" section in light of the limitations of NER that you've uncovered (which I explained in more detail in #25). The fact that NER only turns up three Mexico mentions, when a straightforward search of the word "Mexico" would turn up nearly three times that many, sounds a cautionary note about NER that readers need to be aware of. (On the other hand, the fact that you turned up a California mention using NER sounds a cautionary note about using just a straightforward search, since there may be locations in a corpus you didn't think to add to your search list, especially if you were to continue this research by moving down to the county or municipality level.)

Supposing you have enough time to do it, I wonder if the best way to conclude this discussion would be to include two Google Fusion maps of the Texas ads, one of which would show the results of your NER count, and one of which would show the results of just searching the corpus for each state, country, or abbreviation in the direct hit list you are using. That would spotlight (through the dramatically different shading of Mexico) some of the limitations of NER that you've outlined.

You could then go ahead and include the Mississippi and Arkansas maps as an indication of why continuing to refine this research would ultimately be useful, because it would allow you to test the hypothesis about how self-referential Texas ads are. But the more I think about all these comments, the more I think that conclusion section should be revised so that it doesn't sound like you are presenting the maps as proofs of the hypothesis, but more as proofs-of-concept for what a working NER/location-tagging script could ultimately allow you to do.

It's okay to have a page that ends talking more about limitations of a method, and in some ways that would be preferable to making the findings sound more definite than they are. If you don't have time to make two Texas Google Fusion tables, you can, I think, at least revise the wording around the maps to reflect these discussions.

cc/ @KaitlynSisk

KaitlynSisk commented 10 years ago

I can definitely make another Texas GFT tonight. It doesn't take that much time once we collect the data. Would we want to do total number references? For example, if Texas is mentioned more than once in an advertisement, would we count it once? That's what I did originally, but there were limitations in that the number of counts for the main state was much higher than all of the other ones.

wcaleb commented 10 years ago

Yes, I would still count only one state mention per ad as part of the total count. That's what Aaron's algorithm was doing. The difference between the two maps I'm describing above will be whether you are counting only mentions caught by NER or raw mentions of state names and abbreviations.

Sent from my iPhone

On Apr 27, 2014, at 5:41 PM, KaitlynSisk notifications@github.com wrote:

I can definitely make another Texas GFT tonight. It doesn't take that much time once we collect the data. Would we want to do total number references? For example, if Texas is mentioned more than once in an advertisement, would we count it once? That's what I did originally, but there were limitations in that the number of counts for the main state was much higher than all of the other ones.

— Reply to this email directly or view it on GitHub.

br0nstein commented 10 years ago

@wcaleb

You need to go back over the essay very carefully and make sure that all of your descriptions of what the "counts" represent are consistent. At the very least that means making clearer throughout what the "counts" are---number of ads in a given corpus mentioning a given state. This may also require you to do a little bit more description of your algorithm, translating it into prose. Not everyone will be familiar with set notation, so you may not be able to assume that this way of outlining the algorithm will be clear ("Notice how first …"). Remember our audience: historians who may not know much about programming but are starting to learn.

You're right, I'll go back and make it more clear for various audiences.

Supposing you have enough time to do it, I wonder if the best way to conclude this discussion would be to include two Google Fusion maps of the Texas ads, one of which would show the results of your NER count, and one of which would show the results of just searching the corpus for each state, country, or abbreviation in the direct hit list you are using. That would spotlight (through the dramatically different shading of Mexico) some of the limitations of NER that you've outlined.

Very good idea. This brings it back to the original purpose of the page we talked about, highlighting uses of NER and various approaches to solving the locations question. On it.

I think I will also revise the post to reflect why we are looking at this question in the context of runaway slave ads, and how counting number of mentions is just one approach that makes specific assumptions that we are interested in all "name drops" of a place, taking it entirely out of the location's context in the original ad for example as a projected destination, starting point, etc.

Did your implementation of the algorithm include "Mexico" in the list of known "states" when checking for hits?

I went back over the code after discovering that even counting direct hits off the full text using my script was coming up with lower than expected number for mexico. Turns out there was a bug in the text preprocessing function I was reusing from the locations_tag script. It was replacing "co." with "county." to aid in NER detection, but that was also matching Mexico. and corrupting the word to Mexicounty. Fixed now, and I reuploaded revised numbers.