Details

language_plane_finder.py takes six CLI arguments:

the path to a sqlite database
a table name in the database
a dimensionality count
a subsample size (defaults to None; it isn't required)
a random seed (used by the subsampling process)
a part of speech ('n' for nouns, 'v' for verbs, 'a' for adjectives, defaults to None, it isn't required)

First it creates a wordnet_vocab.GloveLookup object using the first three CLI arguments.

Then it calls wordnet_vocab.yield_word_senses with the part of speech argument. This returns a generator, it creates a list using of [subsample] elements (chosen randomly, seeded by [random seed].

It then has three nested for loops iterating over that list.

i.e.

for word1 in vocab:
  for word2 in vocab:
    for word3 in vocab:

It makes sure that word1 < word2 < word3. It uses the GloveLookup object to get the numpy arrays (point in embedding space) for each of these words, and creates a plane.Plane object constructed from point1, point2 and point3.

It iterates over all the other vocabulary (i.e. any word that's not word1, word2 or word3), gets that word turned into a point and calls the plane.Plane object's distance_to_plane method. Store it in a list or numpy array

Calculate the summary statistics for those lists (inside the triple for loop): minimum, 0.1th percentile, 1th percentile, 25th percentile, median, mean, 75th percentile, 99th percentile, 99.9th percentile, max, standard deviation. Print out word1, word2, word3 and the summary statistics.

Checklist

- [X] Create `language_plane_finder.py` ✓ https://github.com/solresol/wordplanes/commit/c99fad6257a7c8a32c7b441d2fec44e8ff68f940 [Edit](https://github.com/solresol/wordplanes/edit/sweep/create_language_plane_finderpy/language_plane_finder.py)

🚀 Here's the PR! #67

See Sweep's progress at the progress dashboard!

💎 Sweep Pro: I'm using GPT-4. You have unlimited GPT-4 tickets. (tracking ID: None)

[!TIP] I can email you next time I complete a pull request if you set up your email here!

Actions (click)

[ ] ↻ Restart Sweep

GitHub Actions failed

The sandbox appears to be unavailable or down.

Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description.

https://github.com/solresol/wordplanes/blob/55eda2249731fe8b1a4fce01e3d00a7a983453a3/wordnet_vocab.py#L3-L44 https://github.com/solresol/wordplanes/blob/55eda2249731fe8b1a4fce01e3d00a7a983453a3/plane.py#L3-L14 https://github.com/solresol/wordplanes/blob/55eda2249731fe8b1a4fce01e3d00a7a983453a3/embeddings2sqlite.py#L4-L33

Step 2: ⌨️ Coding

[X] Create language_plane_finder.py ✓ https://github.com/solresol/wordplanes/commit/c99fad6257a7c8a32c7b441d2fec44e8ff68f940 Edit
Create language_plane_finder.py with contents:
• Start by importing the necessary modules and classes. This includes `argparse` for command-line argument parsing, `numpy` for numerical operations, `random` for subsampling, `sqlite3` for database operations, and the `GloveLookup` class from `wordnet_vocab.py` and the `Plane` class from `plane.py`.
• Implement command-line argument parsing using `argparse.ArgumentParser`. The script should take six arguments: `--sqlite-database`, `--table`, `--dimensionality`, `--subsample`, `--random-seed`, and `--part-of-speech`. The `--subsample` and `--part-of-speech` arguments should have default values of `None`.
• Create a `GloveLookup` object using the `--sqlite-database`, `--table`, and `--dimensionality` arguments.
• Call the `yield_word_senses` function from `wordnet_vocab.py` with the `--part-of-speech` argument to generate a list of word senses. If the `--subsample` argument is not `None`, randomly subsample the list to the specified size using the `random.sample` function with the `--random-seed` argument as the seed.
• Implement three nested `for` loops to iterate over the vocabulary list. For each combination of three words where word1 < word2 < word3, use the `GloveLookup` object to get the numpy arrays for each word and create a `Plane` object.
• For each other word in the vocabulary, use the `GloveLookup` object to get the numpy array for the word and call the `Plane` object's `distance_to_plane` method. Store the distances in a list or numpy array.
• Inside the triple `for` loop, calculate the summary statistics for the list of distances using `numpy` functions: minimum, 0.1th percentile, 1th percentile, 25th percentile, median, mean, 75th percentile, 99th percentile, 99.9th percentile, max, standard deviation. Print out word1, word2, word3, and the summary statistics.

Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/create_language_plane_finderpy.

🎉 Latest improvements to Sweep:

We just released a dashboard to track Sweep's progress on your issue in real-time, showing every stage of the process – from search to planning and coding.
Sweep uses OpenAI's latest Assistant API to plan code changes and modify code! This is 3x faster and significantly more reliable as it allows Sweep to edit code and validate the changes in tight iterations, the same way as a human would.
Try using the GitHub issues extension to create Sweep issues directly from your editor! GitHub Issues and Pull Requests.

💡 To recreate the pull request edit the issue title or description. To tweak the pull request, leave a comment on the pull request. ^{Join Our Discord}

solresol / wordplanes

Sweep: Create language_plane_finder.py #66