Closed Yvonne-Han closed 4 years ago
No need to upload the NER files to GitHub, as they are version-controlled elsewhere by someone else and we can easily use "a little wget
magic" (see here) to get the necessary file.
The NER code should be okay now so I'm closing this one.
@iangow I've been playing with your NER tagging code these days and trying to get it running so that Maliha can use the output data for her thesis. As you can see from the commit history, I made some changes to your code but I am not 100% sure (some of your code I guess is project-specific for your non-answer paper?). So here's a summary of what I've changed and boxes for you to tick (important things in bold):
findner.py
: Checked and update the stanford_path to the latest one. This should be okay, and I assume I don't need to upload the Stanford NER files to GitHub?create_table.py
: Changed the target schema frombs_linguistics
tose_features
and changed the comment when creating table accordingly. This seems to be okay too.run_NER.py
: Changed the functiongetFileNames
so that it gets files fromstreetevents.calls
(later changed tostreetevents.speaker_data
) instead ofstreetevents.qa_pairs
. I am not too sure about this step (as I don't know what your original code is dealing with) - should I replace it withspeaker_data
?create_table.py
: Fixed a small typo in your SQL query when doing string replacements. I am quite confident about this.findner.py
: Changed python2 to python3. Commented out your Anastasia & Ian Gow test cases. Shouldn't be a big deal.processFileNER.py
: Added some print functions for me to understand what is going on. Can be reverted later.run_NER.py
&processFileNER.py
: Fixed code for converting timestamp. The original code might be outdated. This should be okay as long as I am converting things in the right way.processFileNER.py
: I deleted an SQL query from your code:I don't really understand what is going on here so it is likely that I am making a mistake here. If I include the SQL in the code, it seems to be producing a lot of
NA
entries and the number ofNA
increases as the code runs longer. So I decided to comment it out and it seems to work better (not sure). I've added a comment here too so you can cross-reference it.