passing objects and checksums

juanshishido commented 8 years ago

These are some thoughts that relate to how we pass data in each of the steps we're working on implementing. It was great that the precedent of using pickle files was established. This practice will help us be faster when working downstream by allowing us to sidestep creating those objects again.

What I'm imagining are "main" scripts for each step (e.g., pmi, pca, etc.) with lots of smaller functions for doing specific things. Each of these should both (potentially) create the pickle file and return the object. We can check if the pickle file already exists in a predefined location (some relative path, probably in data/). If it does, the script should just load and return that object instead of trying to recreate it.

Because we're iterating and making improvements to our process, our objects might change (e.g., the NaNs in data_matrix). To check that we have the "right" version of those objects, we should compute and check hashes. This can be done using the hashlib module (probably hashlib.md5()). We can create a checksums file for each object (and update it when we make changes to the way the object is created). We'll then use this to check whether the version of the pickle file we have locally is what we should have. If not, the script should recreate the object and the pickle file.

Thoughts?

matarhaller commented 8 years ago

I really like the idea of having everything modular and saving out pickle files. It will make our lives a lot easier. The data is now also hopefully getting smaller (since we reduced dimensionality with PCA), so can we just have it checked into the repo?

I don't know anything about hashes, but am happy to learn and to use it. It sounds like a cool idea.

That being said, I think it's a lower priority than getting the algorithms working and having conclusions to present/talk about. Right now we have the cool PMI ngrams and some dimensionality reduction, but no conclusions yet. I was thinking of maybe reaching out to Marti to ask her about PCA (reasonable amount of variance accounted for) and clustering (which algorithms and the iterative clustering she mentioned).

matarhaller commented 8 years ago

So going off of your suggestion, I just wrote a .py script with functions for running PCA and kmeans. It's in my branch, and I'm not merging it because I'm afraid of breaking something. But take a look and let me know if it's sort of what you were going for (feel free to change it / edit it as you see fit).

juanshishido commented 8 years ago

@matarhaller Yes, let's check the pickle files we can into the repo. I agree that it's a lower priority, but the sooner we implement and get that workflow set up, the more benefit we'll get from it.

Yeah, I need to work on qualitatively describing the clusters. I think it's a good idea to reach out to Marti.

Awesome about the .py files! It looks like you've been pulling from master, so things appear to line up! Merging shouldn't be an issue. I can look into more closely and pull your stuff to master.

juanshishido commented 8 years ago

Just took a look at cd03df2d15c480850398a6db17e7520afb69365b. It's great! It's short, easy to read, provides nice feedback, and gives the user some control.

I'll work on the hashing and then we can modify these functions slightly.

Let's merge this with master. I can later today, if you want.

jnaras commented 8 years ago

Hashing is a good idea. I'm worried about the objects getting corrupted somehow. Github does have a file size limit, so we might not be able to push all our objects up on here. But maybe Google Drive?

juanshishido commented 8 years ago

@jnaras That's what the checksums would be for. If the hash is not what we expect, then we can assume it's different (a different object or corrupted somehow).

juanshishido commented 8 years ago

matar has been merged with master. I noticed there was a conflict with one file prior to merging so I did the following:

$ git checkout matar
$ git pull origin matar
$ git checkout master
$ git merge -Xtheirs matar
$ git push origin master

matarhaller commented 8 years ago

Cool, thanks!!

juanshishido commented 8 years ago

I'm close to pushing the script that checks the hashes. It works on the datamatrix.pkl file that I already have.

Before pushing, though, I decided to create two new data matrix files, using the code in Calculate PMI features.ipynb. Even though it's supposed to be the same data, the hash values I'm getting back are different. This is because the matrices are slightly different.

>>> Y # based on datamatrix2.pkl
matrix([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        ..., 
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

>>> Z # based on datamatrix3.pkl
matrix([[ 0.,  0.        ,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.        ,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.00091158,  0., ...,  0.,  0.,  0.],
        ..., 
        [ 0.,  0.        ,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.        ,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.        ,  0., ...,  0.,  0.,  0.]])

When I compare the matrices in the same session (without loading the pickles), they are equal. So, the slight differences may have to do with the way the data is serialized, but I don't know how that's being done.

Anyway, my work-around is to create the hashes based on the rounded sum (to 8 digits) or the matrices. (I could also do it on the mean to 16 digits.)

I'll let you both know when the script is pushed.

juanshishido commented 8 years ago

Three functions added with f2839c23bbd3fc7cee3920fd4cdc2445643c7769: hash_update, hash_get, and hash_same. We'll mostly just use the first and last of these. The code assumes that the pickled objects and the JSON file with the hashes are in data/.

I'm imagining we can use os.path.isfile(fname) and hash_same(fname) to determine whether or not to run the code that creates the pickled objects.

If the pickled object exists and
- if the hashes are the same, then just return it (don't recreate it) with pickle.load(open(fname, 'rb')) or something similar
- if the hashes are different, on the other hand, then we should recreate the pickled object
- this assumes that the hash in hashes.json corresponds to the correct version of the object
- when we modify the script that creates the pickled object, we should use hash_update(f, h, force_update=True)
If the pickled object does not exist, then create it

If this (or the code) isn't clear, I can add the function calls to the scripts.

matarhaller commented 8 years ago

Cool - I'm not sure how to call your functions, but I'm happy to incorporate it into my function. Should we incorporate all of this into a single function that is run at the beginning of each of our functions?

juanshishido commented 8 years ago

I'm thinking of wrapping the pickle-making files (e.g., run_pca()) into functions. Something like the following, which assumes the functions are being called from okcupid/:

def foo(fname):
    fpath = 'data/'+fname
    # pickle exists
    if os.path.isfile(fpath):
        # same hash
        if hash_same(fname):
            # the script has been updated
            if script_updated(fpath, 'reduce_dimensions.py'):
                run_pca(fname, force_update=True)
            # the script hasn't been updated
            else:
                X = pickle.load(open(fname, 'rb'))
                return X
        # different hash
        else:
            run_pca(fname)
    # pickle does not exist
    else:
        run_pca(fname)

It's not as straight forward—or easy to read—as I had hoped. Sorry.

Most of the logic occurs if the pickled object (identified by fname) exists. If it's the same hash there are two possibilities: (1) the script that creates the object has been updated since the pickled object was created, which implies the object needs to be recreated; (2) the script hasn't been updated, so the object is returned. For this, we need another function.

def script_updated(p, s):
    # check if the script has been updated
    pt = os.path.getmtime(p)
    st = os.path.getmtime(s)
    return st > ft

This gets the last modified time for the pickled object and the script that creates it.

We'll have to modify run_pca() a little bit. We'll have to include hash_update() at the end of it.

It might just be easier to make foo() take a function (e.g., run_pca()) as an argument.

For now, let's not worry about it. I will find some time to make these changes and update the files in master.

juanshishido commented 8 years ago

I'm actually thinking of using the functions to create several variables that get passed in as arguments to run_pca(). Seems like that'll make it cleaner. I'll try to implement it.

juanshishido commented 8 years ago

I added the wrapper function (called make for now) with 76cf6f6e64e7648ca8a9f9c07d2256597d3b7617. I will test using run_pca, which I'm adding a few lines to (for updating the hash). Then, I'll update hashes.json so that it reflects the latest scripts and .pkl files.

matarhaller commented 8 years ago

For some reason I got an error when loading in utils from checksum. It says that the break is outside of a loop (line 61), so I changed that to a return. Might not be the correct change to make, but I just needed to be able to load stuff in.

juanshishido commented 8 years ago

@matarhaller Good find. Sorry about that. I pushed a new version of checkums.py to master last night, which I renamed hash.py, and didn't mention it since I still needed to make some updates. I think the return is fine. I am going to push a notebook tonight to show how I'm thinking things could work.

juanshishido commented 8 years ago

I added a notebook with a demonstration of how to use the hash checking code.

It might not work with run_kmeans because of how the results can be different each time.

Also, we'll have to modify calculate_pmi_features.py slightly to be able to work with make. Will close this issue once that's done.

juanshishido commented 8 years ago

Just a note for when we update the pickle-making scripts. If the change does not affect the output (e.g., I just had to comment a line in calculate_pmi_features.py), you can do: $ touch <filename>.pkl. This just updates the timestamp of the pickled object so that make doesn't actually try to recreate it. I wish make was smarter....

jnaras commented 8 years ago

Okay! Thanks for the clarification!

juanshishido / okcupid

passing objects and checksums #9