[hangman-dictionary] Document how to create a dictionary using open library

pangiann commented 2 years ago

Hangman description

Hangman is a paper and pencil guessing game for two or more players. One player thinks of a word, phrase or sentence and the other(s) tries to guess it by suggesting letters within a certain number of guesses.

In our case, computer has to think of a word. Thus, we need to create a dictionary from which the computer will choose randomly words. We will make use of the open library to load words from a variety of books.

pangiann commented 2 years ago

Dictionary Properties

Every word has to be unique.
The dictionary has to include at least 20 candidate words.
All words need to have at least 6 characters.
20% of all the words has to include at least 9 characters.

Word

What is a word? How do we describe a word? Well, a word is a sequence of letters from A to Z (a to z).

Reasons:

It's not reasonable to have a word that includes both letters and numbers. Who would guess pangiann13?
Hangman with special characters like: "-" or "_" etc. doesn't make sense.

Thus, our design decision is to permit words to include only letters.

pangiann commented 2 years ago

Open Library

Open Library is an open, editable library catalog, building towards a web page for every book ever published. There, we can find big descriptions for millions of books. For example see this page: https://openlibrary.org/books/OL33890423M/Harry_Potter_and_the_Deathly_Hallows

From there we're going to load our words to build the dictionary.

pangiann commented 2 years ago

OpenLibary API

Open Library offers a suite of APIs to help developers get up and running with OpenLibrary's data. This includes RESTful APIs, which make Open Library data availabile in JSON, YAML and RDF/XML format.

In our case we want to work with the Books API and more specifically the Works API which will eventually returns to us a JSON with the description of the selected book.

pangiann commented 2 years ago

Download description of a book manually

Choose a book from the website (https://openlibrary.org).
Check that the description is sufficiently big.
Keep the Open Library ID of the book to use later on the request. I.e. for "A Game of Thrones, Book One of A Song of Ice and Fire" book the ID is OL31390631M.
Use an HTTP request to get the description: https://openlibrary.org/works/OL31390631M.json This request has the form of /works/ID
From the description we get the value field.

Using the dictionary properties as a reference we keep the allowed words. This returns:

{
"description": 
{
   "type": "/type/text", 
   "value": "In A Game of Thrones, George R.R. Martin has created a genuine masterpiece, bringing together the best the genre has to offer. Mystery, intrigue, romance, and adventure fill the pages of the first volume in an epic series sure to delight fantansy fans everywhere.\r\n\r\nIn a land where summers can last decades and winters a lifetime, trouble is brewing. The cold is returning, and in the frozen wastes of the north of Winterfell, sinister and supernatural forces are massing beyond the kingdom's protective Wall. At the center of the conflict lie the Starks of Winterfell, a family as harsh and unyielding as the land they were born to. Sweeping from a land of brutal cold to a distant summertime kingdom of epicurean plenty, here is a tale of lords and ladies, soldiers and sorcerers, assassins and bastards, who come together in a time of grim omens. Amid plots and counterplots, tragedy and betrayal, victory and terror, the fate of the Starks, their allies, and their enemies hangs perilously in the balance, as each endeavors to win that deadliest of conflicts: the game of thrones.\r\n--back cover"
}, 
"identifiers": 
{
   "goodreads": ["55946549"], 
   "wikidata": ["Q105357235"]}, 
   "title": "A Game of Thrones", 
   "subtitle": "Book One of A Song of Ice and Fire", 
   "publish_date": "1999?", 
   "publishers": ["Spectra / Bantam Books"], 
   "series": ["A Song of Ice and Fire, #1"],
   "covers": [10513947], 
   "physical_format": "mass market paperback", "ocaid": "gameofthrones0001mart", 
   "publish_places": ["New York"], 
   "edition_name": "Bantam Paperback edition (14)", 
   "pagination": "835p.", 
   "isbn_13": ["9780553573404"], 
   "languages": [{"key": "/languages/eng"}], 
   "isbn_10": ["0553573403"], 
   "copyright_date": "1997", 
   "by_statement": "George R.R. Martin", 
   "type": {"key": "/type/edition"}, 
   "key": "/books/OL31390631M", 
   "number_of_pages": 854, 
   "works": [{"key": "/works/OL257943W"}], 
   "latest_revision": 5, 
   "revision": 5, 
   "created": {"type": "/type/datetime", "value": "2020-11-19T13:46:13.159633"}, 
   "last_modified": {"type": "/type/datetime", 
   "value": "2021-04-17T08:21:34.660822"
    }
}

pangiann commented 2 years ago

UX design

dictionary = Dictionary(dictionary_id)

Dictionary class builds a dictionary of words loaded from various descriptions found in millions of book from the OpenLibrary. The ID specifies the book from which this class will get the description.

In a nutshell, this class does the following:

Checks if dictionary of same type/ID already exists
If yes:
- Finds the path to the file
- Loads the dictionary from the file
- Returns the newly created object for later use
If not:
- Gets the book's description using the unique ID
- Processes description to create a valid dictionary
- Saves it to a file
- Returns the dictionary

pangiann commented 2 years ago

get_book_description()

This request has the form: GET /works/. In return, we get the following JSON structure:

 {
     "description":
     {
         "type": "",
         "value": ""
     },
     "identifiers":
     {
          ""
     }
 }

From the description we keep the value field.

We use the HTTPClient Java module for the HTTP GET request. https://github.com/pangiann/hangman/issues/2

Specifically, this function does the following:
1. Builds an HTTP client which provides configuration information like: a. the preferred protocol version (HTTP_2) b. follow redirects
2. Builds an HTTP request which sets: a. a request's URI: /works/book_id.json b. the type of the request: GET c. the headers: Content-type = application/json
3. Sends the request synchronously. We also configure the BodyHandler which dictates to interpret the HTTP Response as a 'String'.
4. Gets the HTTP Response which provides methods for accessing the response.
5. If status code is not equal to 200 throws exception.
6. Else, converts the response string to JSONObject
7. Gets the value field from the description.
@return description A string which is a book's description.

pangiann commented 2 years ago

Process Description

In this section we're going to describe how we will process the book's description to create a list of valid words (from dictionary properties) and build our dictionary.

Split description into words. The challenge here is how we'll split them. We don't want our delimiter to be the ' ' character. Why? For example: "The coats drink milk, right?" This will produce: ["The", "coats", "drink", "milk,", "right?"]. Well, the last two words are obv not valid. So, we decide our delimiter to be everything but word characters: [a-zA-Z0-9]. So, we'll use this regex expression: \W+. \W = [a-zA-Z0-9].
Create a dictionary Set. There, we'll add the valid words. We use a Set because we accept no duplicates.
Add the valid words into the Set.
- If the length of the word is bigger than 6 add it.
Validate dictionary:
- If length of dictionary < 20 throw Exception.
- If not 20% of all the words have at least 9 characters then raise Exception.
- Else, the dictionary is valid.

pangiann / hangman