gobbykid / gobbykid-characters-extraction

Gobbykid project's charcaters extraction. The repository contains the Python files used to extract charcaters' names from 19th Century books.
https://the-gobbykid-project.gitbook.io/gobbykid-project/analytics/characters-extraction-and-analytics
MIT License
0 stars 0 forks source link

Enhance the check_names function because of "not a character" entities in the results #6

Closed tommasobattisti closed 2 years ago

tommasobattisti commented 2 years ago

This issue is meant to host suggestions in terms of words and regex patterns to consider when the results have to be cleaned.

eliarizzetto commented 2 years ago

Output file of run.py: results of the function for characters extraction and gender recognition.

output_01-08_characters_and_gender.txt

tommasobattisti commented 2 years ago

Leave a comment in form of:

tommasobattisti commented 2 years ago
tommasobattisti commented 2 years ago
tommasobattisti commented 2 years ago

Book 1857_browne-grannys-wonderful-chair:

tommasobattisti commented 2 years ago

Results at the current state of the branch (August 3, 2022):

output_01-08_characters_and_gender.txt

tommasobattisti commented 2 years ago

Book 1857_browne-grannys-wonderful-chair:

* “Majesty”: do not include

* “Court”: do not include

* “Royal Prince” and “Crown Prince”: do not include (do not include /w+\sprince(ss)?/ but include strings such as “Princess Jane” or “Prince Mark” )

* “Michaelmas”: day of San Michael --> do not include

* “Frostyface” retrieved only, while it's “Dame Frostyface” and the “Dame” before the name is actually useful to put the string in the list of feminine names

* The same occurs for: “Greendalind”  (Princess Greendalind); “Wantall” (Queen Wantall); “Winwealth” (King Winwealth); “Greensleeves” (Lady Greensleeves); "Wisewit" (Prince Wisewit).

SOLVED

tommasobattisti commented 2 years ago

Book: 1888_wilde-the-happy-prince-and-other-tales.txt:

tommasobattisti commented 2 years ago

Book: 1872_de-la-ramee-a-dog-of-flanders.txt:

tommasobattisti commented 2 years ago

Book 1865_carroll-alices-adventures-in-wonderland.txt:

eliarizzetto commented 2 years ago

CONTINUES: Book 1865_carroll-alices-adventures-in-wonderland.txt:


Do NOT include:

Other operations:

eliarizzetto commented 2 years ago

1877_molesworth-the-cuckoo-clock

Either there's been some kind of error in pre-processing this text or it has been badly OCRed originally: words like "sothat", "hewas", "forit" originate from whitespaces being incorrectly eliminated.

Warning Attention:MAKE SURE THE FILE YOU HAVE LOADED IN THE CORPUS YOU ARE USING IS THE CORRECT ONE, I.E. THE ONE BELOW: The Cuckoo Clock

I have just changed it in my corpus, so you probably need to change it too!

eliarizzetto commented 2 years ago

1877_sewell-black-beauty

Do NOT include:

Other remarks:

eliarizzetto commented 2 years ago

Note New output file to check results against

This is the output of run.py (character extraction + gender recognition) with a filtering threshold applied: the charEx function only returns words occurring uppercase only and more than 2 times.

output_run_threshold2_090822.txt

tommasobattisti commented 2 years ago

Note New output file to check results against

This is the output of run.py (character extraction + gender recognition) with a filtering threshold applied: the charEx function only returns words occurring uppercase only and more than once.


Updated version:

output_run_threshold2_160822.txt

tommasobattisti commented 2 years ago

Related to the last version of the results: (Part I)

Book: 1872_de-la-ramee-a-dog-of-flanders:

Do not include:

Book: 1857_browne-grannys-wonderful-chair:


Book: 1877_molesworth-the-cuckoo-clock:

Do not include:

Other operations:

Book: 1902_potter-the-tale-of-peter-rabbit:

Book: 1888_wilde-the-happy-prince-and-other-tales:

Do not include:

Book: 1865_carroll-alices-adventures-in-wonderland

Do not include:

Other:


Book: 1877_sewell-black-beauty

Do not include:

Other remarks:


Other considerations not related to specific books:

It could be useful to not consider words that finish in "ish” so to avoid including “English”, “Scottish”, etc.

tommasobattisti commented 2 years ago

Related to the last version of the results: (Part II)

Book: 1899_nesbit-the-story-of-the-treasure-seekers

Do not include:

Other remarks:


Book: 1869_ewing-mrs-overtheways-remembrances

Do not include:

Other remarks:

Book: 1886_hodgson-burnett-little-lord-fauntleroy

Do not include:

Other remarks:


Book: 1871_macdonald-at-the-back-of-the-north-wind

Do not include:

Other remarks:

Book: 1894_kipling-the-jungle-book

Do not include:

Other remarks:


Not related to specific books:

tommasobattisti commented 2 years ago

Related to the last version of the results: (Part III)

Book: 1876_twain-the-adventures-of-tom-sawyer

Do not include:

Other remarks:

Book: 1883_stevenson-treasure-island

Do not include:

Other remarks:

Book: 1857_hughes-tom-browns-school-days

Do not include:


Further considerations:

tommasobattisti commented 2 years ago

Related to the last version of the results: (Part IV)

Book: 1869_dickens-david-copperfield

Do not include:

Other remarks:

tommasobattisti commented 2 years ago

Done:

tommasobattisti commented 2 years ago

Minor common problems:

1877_molesworth-the-cuckoo-clock

1865_carroll-alices-adventures-in-wonderland

1886_hodgson-burnett-little-lord-fauntleroy

Book: 1894_kipling-the-jungle-book

Book: 1869_dickens-david-copperfield

tommasobattisti commented 2 years ago

General imporovements: