Review NER code changes made by Yvonne

Yvonne-Han commented 4 years ago

@iangow I've been playing with your NER tagging code these days and trying to get it running so that Maliha can use the output data for her thesis. As you can see from the commit history, I made some changes to your code but I am not 100% sure (some of your code I guess is project-specific for your non-answer paper?). So here's a summary of what I've changed and boxes for you to tick (important things in bold):

[x] #1. findner.py: Checked and update the stanford_path to the latest one. This should be okay, and I assume I don't need to upload the Stanford NER files to GitHub?
[x] #2. create_table.py: Changed the target schema from bs_linguistics to se_features and changed the comment when creating table accordingly. This seems to be okay too.
[x] #3. run_NER.py: Changed the function getFileNames so that it gets files from streetevents.calls (later changed to streetevents.speaker_data) instead of streetevents.qa_pairs. I am not too sure about this step (as I don't know what your original code is dealing with) - should I replace it with speaker_data?
[x] #4. create_table.py: Fixed a small typo in your SQL query when doing string replacements. I am quite confident about this.
[x] #5. findner.py: Changed python2 to python3. Commented out your Anastasia & Ian Gow test cases. Shouldn't be a big deal.
[x] #6. processFileNER.py: Added some print functions for me to understand what is going on. Can be reverted later.
[x] #7. & #8. run_NER.py & processFileNER.py: Fixed code for converting timestamp. The original code might be outdated. This should be okay as long as I am converting things in the right way.

[x] #9. processFileNER.py: I deleted an SQL query from your code:

engine.execute("DELETE FROM %s.%s WHERE file_name ='%s'" % (ner_schema, ner_table, file_name))

I don't really understand what is going on here so it is likely that I am making a mistake here. If I include the SQL in the code, it seems to be producing a lot of NA entries and the number of NA increases as the code runs longer. So I decided to comment it out and it seems to work better (not sure). I've added a comment here too so you can cross-reference it.

iangow commented 4 years ago

No need to upload the NER files to GitHub, as they are version-controlled elsewhere by someone else and we can easily use "a little wget magic" (see here) to get the necessary file.

Yvonne-Han commented 4 years ago

The NER code should be okay now so I'm closing this one.

iangow / se_features

Review NER code changes made by Yvonne #19