Restructure data for performance

jmsv / ety-python

A Python module to discover the etymology of words

http://ety-python.rtfd.io

MIT License

144 stars 18 forks source link

Restructure data for performance #24

Closed jmsv closed 6 years ago

jmsv commented 6 years ago

At the moment, the json dataset is structured as follows:

[
  {
    "a_lang": "eng",
    "a_word": "potato",
    "b_lang": "tnq",
    "b_word": "batata"
  },
  { ...

This is loaded as a Python dict and filtered using:

row = list(filter(
    lambda entry: entry['a_word'] == self.word and entry[
        'a_lang'] == self.language.iso, etymwn_data))

If the data was restructured so words acted as dict keys, referencing words would be much faster since dicts are an implementation of hash tables.

Data could instead be structured by language then by word, as follows:

{
    "lang":{
        "word":[
            {
                "origin-word":"origin-lang"
            }
        ]
    }
}

for example,

{
    "eng":{
        "airport":[
            {"air":"eng"},
            {"port":"eng"}
        ],
        "banana":[
            {"banaana":"wol"}
        ]
    },
    "lat":{
        "fructus":[
            {"fruor":"lat"}
        ]
    }
}

Origin words are individual dicts to prevent key collisions.

jmsv commented 6 years ago

Open to suggestions for better ways of structuring it

alxwrd commented 6 years ago

Potentially could expand out the origin dicts:

"eng":{
    "airport":[
        {"word": "air", "lang": "eng"},
        {"word": "port", "lang": "eng"}
    ],
    "banana":[
        {"word": "banaana", "lang": "wol"}
    ]
}

I think it could make the loading of origins slightly clearer:

source_origins = data[self.language.iso][self.word]

origins = [
    ety.Word(origin["word"], origin["lang"]) for origin in source_origins
]

source_origins = data[self.language.iso][self.word]

origins = [
    ety.Word(*info) for origin in source_origins for info in origin.items()
]

The downside to expanding out the dicts is it'll result in a larger file.

jmsv commented 6 years ago

@alxwrd I think it might be better to keep the smaller file and just comment code or something to explain what's happening.

Rather than using *info, word and lang could be unpacked by hand which is probably more readable:

origins = [
    ety.Word(word, lang) for origin in source_origins for word, lang in origin.items()
]

alxwrd commented 6 years ago

Yea that's nice actually 😋

For creating the new data file, is it going to be rebuilt from the original source? Or transforming the current file?

I think it'd be good to start from the source .tsv, and create a build_ety_data script that could either live in the repo root, or ety/wn/. The script would fetch the archived data, unpack it, perform the transform. Then if there are any updates to the source, the data can easily be updated.

jmsv commented 6 years ago

Yeah I was thinking start from original source too. I was in touch with the guy that maintains the dataset a couple of weeks ago and apparently a new version will be released hopefully by August.

A script that stays in the repo is definitely a good idea - this would probably be best kept in ety/wn.

It's probably a good idea to only download the dataset if it's not available locally, but with the option to redownload; the original source is quite big so downloading is time consuming