Closed tommasobattisti closed 2 years ago
Output file of run.py: results of the function for characters extraction and gender recognition.
Leave a comment in form of:
Book 1857_browne-grannys-wonderful-chair:
* “Majesty”: do not include * “Court”: do not include * “Royal Prince” and “Crown Prince”: do not include (do not include /w+\sprince(ss)?/ but include strings such as “Princess Jane” or “Prince Mark” ) * “Michaelmas”: day of San Michael --> do not include * “Frostyface” retrieved only, while it's “Dame Frostyface” and the “Dame” before the name is actually useful to put the string in the list of feminine names * The same occurs for: “Greendalind” (Princess Greendalind); “Wantall” (Queen Wantall); “Winwealth” (King Winwealth); “Greensleeves” (Lady Greensleeves); "Wisewit" (Prince Wisewit).
SOLVED
TO BE COMPLETED
"Mock Turtle Soup" ---> just "Mock Turtle"
"Cheshire Cat" = "Cheshire Puss" --> unify!
"elsie lacie" comes from "Elsie, Lacie, and Tillie" --> do not include OR separate them (just 1 occurence)
MISSED DETECTION: "Bill" (15+ occurences)
Either there's been some kind of error in pre-processing this text or it has been badly OCRed originally: words like "sothat", "hewas", "forit" originate from whitespaces being incorrectly eliminated.
Warning Attention:MAKE SURE THE FILE YOU HAVE LOADED IN THE CORPUS YOU ARE USING IS THE CORRECT ONE, I.E. THE ONE BELOW: The Cuckoo Clock
I have just changed it in my corpus, so you probably need to change it too!
Do NOT include:
Other remarks:
Note New output file to check results against
This is the output of run.py (character extraction + gender recognition) with a filtering threshold applied: the charEx function only returns words occurring uppercase only and more than 2 times.
Note New output file to check results against
This is the output of run.py (character extraction + gender recognition) with a filtering threshold applied: the charEx function only returns words occurring uppercase only and more than once.
Do not include:
Do not include:
Other operations:
Do not include:
Do not include:
Other:
Do not include:
Other remarks:
It could be useful to not consider words that finish in "ish” so to avoid including “English”, “Scottish”, etc.
Do not include:
Other remarks:
Do not include:
Other remarks:
Do not include:
Other remarks:
Do not include:
Other remarks:
Do not include:
Other remarks:
Do not include:
Other remarks:
Do not include:
Other remarks:
Do not include:
Do not include:
Other remarks:
This issue is meant to host suggestions in terms of words and regex patterns to consider when the results have to be cleaned.