neocl / jamdict

Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings
MIT License
130 stars 12 forks source link

How to search for part of speech #22

Closed nlovell1 closed 3 years ago

nlovell1 commented 3 years ago

Hi- how am I able to search by part of speech?

letuananh commented 3 years ago

Hi @thinkingbox12 Current version doesn't support searching by POS but this is a great suggestion. I'll add POS-based filtering to the future enhancement list. Thank you.

For now you can search by word form and then filter the results by POSes in the sense object:

result = jam.lookup('おかえし')
for entry in result.entries:
    for sense in entry:
        print(sense.text(), sense.pos)  # sense.pos is an array of potential POS

Update notes

As of 25 May 2021, both iteration search and part-of-speech filtering are available in release 0.1a11 so the work around provided below is no longer needed.

Documentation for the mentioned features can be found here: https://jamdict.readthedocs.io/en/latest/recipes.html#iteration-search

nlovell1 commented 3 years ago

Thank you for the reply @letuananh . What, in your opinion, would be the fastest way to iterate through every entry with a given part of speech tag? Say I wanted to get all 'expressions' or all 'Kansai-dialect' phrases or even all 'godan verbs' for example.

letuananh commented 3 years ago

This must be designed carefully. For example if we search for all available nouns we may end up pull out a big chunk of the database out and performance may suffer. I'll look at the database stats and may add a limit option to prevent this from happening. I've just released version 0.1a8 and I'll work on this next. I'll get back to you really soon.

nlovell1 commented 3 years ago

Thank you for your reply. Two things:

  1. I would like to contribute where I can. I'm a pretty novice coder, but if it doesn't detract from your workflow, I'd love to help.
  2. What is a current workaround that I could experiment with in the meantime? Some POS have a lot less entries, for example archaic verbs. I don't know if that gets around the problem of pulling out a lot of the database, though. One of my current projects relies on this and I'd love to get something rolling, even if it has poor performance for the time being. Unless, are you saying that the current implementation can't accommodate for searching by POS at all, even with a temp. workaround?
letuananh commented 3 years ago

hi @thinkingbox12 . All contributions are welcome. I believe there are many ways that you can contribute to jamdict. For instance you can help finding and fixing bugs or suggesting/developing new cool features (like what you have done in this thread). It may be a little bit tricky when we first start working together but eventually I believe things will get easier.

I am aware that Jamdict is lack of a good documentation and I'm still working on this (first docs release was yesterday -> https://jamdict.readthedocs.io/). I'll try my best to document it asap but for now please bear with me here.

It's possible to have the functionality you want with the current release by accessing to the database directly using lower level APIs. These can be used for now but they are prone to future changes so please keep that in mind.

When you create a Jamdict object, you have direct access to the underlying databases, via these properties

from jamdict import Jamdict
jam = Jamdict()
>>> jam.jmdict    # jamdict.JMDictSQLite object for accessing word dictionary
>>> jam.kd2       # jamdict.KanjiDic2SQLite object, for accessing kanji dictionary
>>> jam.jmnedict  # jamdict.JMNEDictSQLite object, for accessing named-entities dictionary

You can perform database queries on each of these databases by obtaining a database cursor with ctx() function (i.e. database query context). For example the following code list down all existing part-of-speeches in the database.

# returns a list of sqlite3.Row object
pos_rows = jam.jmdict.ctx().execute("SELECT DISTINCT text FROM pos")  

# access columns in each query row by name
all_pos = [x['text'] for x in pos_rows]  

# sort all POS
all_pos.sort()
for pos in all_pos:
    print(pos)

Words (and also named entities) can be retrieved directly using their idseq. Each word may have many Senses (meaning) and each Sense may have different pos.

# Entry (idseq) --(has many)--> Sense --(has many)--> pos

You may also look at the database schema here: https://raw.githubusercontent.com/wiki/neocl/jamdict/images/jamdict_db_schema.png

Say we want to get all irregular suru verbs, we can start with finding all Sense IDs with pos = 'suru verb - irregular', and then find all the Entry idseq connected to those Senses.

Since we hit the database so many times (to find the IDs, to retrieve each word, etc.), we also should consider to reuse the database connection using database context to have better performance (with jam.jmdict.ctx() as ctx: and ctx=ctx in the code below).

Here is the sample code:

# find all idseq of lexical entry (i.e. words) that have at least 1 sense with pos = suru verb - irregular
with jam.jmdict.ctx() as ctx:
    # query all word's idseqs
    rows = ctx.execute(
        query="SELECT DISTINCT idseq FROM Sense WHERE ID IN (SELECT sid FROM pos WHERE text = ?) LIMIT 10000",
        params=("suru verb - irregular",))
    for row in rows:
        # reuse database connection with ctx=ctx for better performance
        word = jam.jmdict.get_entry(idseq=row['idseq'], ctx=ctx)
        print(word)

I hope this helps for now. Please feel free to contact me if you need further information. Have a nice day.

nlovell1 commented 3 years ago

Thank you very much @letuananh. Your examples and explanation are helpful. I'm new to this, it will become more comfortable over time. What is the best way to start working on the documentation? I'd like to help where I can with that, perhaps a few of the things I've tried to work out already.

letuananh commented 3 years ago

Thank you @thinkingbox12. Yes we can start with improving the documentation. I use Sphinx and RST to write the docs. Everything is stored in the docs folder. Basically you need to install Sphinx

# install Sphinx
pip install -U Sphinx

# clone jamdict to your machine
git clone https://github.com/thinkingbox12/jamdict
cd jamdict/docs

# build the docs
make dirhtml

# serve the docs locally
python3 -m http.server 7000 --directory _build/dirhtml

Now the docs should be ready to view at http://localhost:7000

You can fork the repository to your Github account and work from there. Once you think it is OK you can create a pull request and I will merge it into this main repo. Please feel free to let me know if you need more info. Thanks again :)

nlovell1 commented 3 years ago

Thank you for the help. I'd love to contribute as soon as I can. I'm nearing the end of my school term, so I think I will be more helpful in the next few weeks. I appreciate your interest in my questions. Have a great day.

nlovell1 commented 3 years ago

Hi- I know we discussed that the solution you presented isn't ideal and is just a workaround for the time being, but I just wanted to make mention that it gets slower the more you iterate. For example, when finding all ichidan verbs, the first 500 took about 6 seconds to traverse, with each 500 after taking about 4 seconds longer than the last. I noticed traversing through all the nouns was especially slow. There were over 100k entries, lol. I just thought to mention this just in case. Please disregard this if it is just a consequence of the current workaround, before the update rolls out.

letuananh commented 3 years ago

Hi. This is surprising as it shouldn't take that long to query the ichidan verbs. I made this query benchmark on Replit to try it out and it takes about 8 seconds with this tiny cloud machine to query all ichidan verbs.

https://replit.com/@tuananhle/jamdict-query-benchmark

Benchmark querying Ichidan verb ...
Start counting ...
Runtime: 9 seconds
Found 3637 Ichidan verb
Start counting ...
Runtime: 8 seconds
Found 3637 Ichidan verb
Start counting ...
...
It takes about 8 seconds to query all Ichidan verb

You can fork that machine to play around with the code :)

As for the common nouns, it will take time to pull out a large number of words from the dictionary anyway. There are several solutions for this (I'm still trying to determine which is the best direction). Here are a few on top of my head at the moment:

  1. iter_lookup: go through each lexical entry, yield it and discard it after that. This way users will choose what to keep and what to discard instead of holding a large LookupResult object. It won't help if the user chooses to keep 100k+ records in the memory.
  2. partial_lookup: Let the users choose what information to pull out of the database (i.e. only kana forms, only the senses, etc.)
  3. force a limit by default (e.g. n=5000-10000)

In your case, are you trying to analyse the whole dictionary or something like that? That will definitely take time. May be if I know what you are trying to achieve I can come up with some better solution.

nlovell1 commented 3 years ago

Thank you for the quick benchmark. I don't know what's going on here- I did something similar.

with jam.jmdict.ctx() as ctx:
    # query all word's idseqs
    counter = 0
    rows = ctx.select(
        query="SELECT DISTINCT idseq FROM Sense WHERE ID IN (SELECT sid FROM pos WHERE text = ?)",
        params=("Ichidan verb",))
    print("Starting ichidan verbs...")
    for row in rows:
        # reuse database connection with ctx=ctx for better performance
        if((counter % 500) == 0):
           print(counter)
        counter += 1
        word = jam.jmdict.get_entry(idseq=row['idseq'], ctx=ctx)
        ruler.add_patterns([{"label": "ICHIDANVERB", "pattern": x.text} for x in word.kanji_forms])
        ruler.add_patterns([{"label": "ICHIDANVERB", "pattern": x.text} for x in word.kana_forms])
    print("Finished with ichidan verbs...")

I am trying to query thorough all entries in the dictionary by their primary (maybe secondary) part of speech classification. In this task, as it currently stands, I am only interested in the kana/kanji forms and the POS, but not anything else. This might save time. The second option you proposed seems reasonable.

Thanks again.

letuananh commented 3 years ago

I don't know what that ruler is, but I suspect ruler.add_patterns() is the thing that is slowing you down. You can try adding all the words to a list first, and then loop through the list and add the forms into the ruler. This way you can benchmark them separately :)

letuananh commented 3 years ago

Hi again. I just want to share that both iteration search and part-of-speech filtering are available in release 0.1a11 so the work around provided in this thread is no longer needed.

Documentation for the mentioned features can be found here: https://jamdict.readthedocs.io/en/latest/recipes.html#iteration-search