pyrevitlabs / pyRevit

Rapid Application Development (RAD) Environment for Autodesk Revit®
http://wiki.pyrevitlabs.io
GNU General Public License v3.0
1.3k stars 332 forks source link

Discussion: Ideas for optimizing loading > Cache Objects #123

Closed gtalarico closed 8 years ago

gtalarico commented 8 years ago

Background

Load time, even with verbose off is long consider some people like me restart revit several times a day. Those seconds add-up. A few people I have shared pyrevit with have mentioned this. Ideally, loading the scripts should be as seamless as other addins.

Proposal:

Besides continuing to update loading methods (reading files only once, and improving code perfomancem, etc) I would like to propose the idea of creating a cache. In most reboots, contents of pyrevit folder remain unchanged, so results of the processing the folders could be cached. If they are unchanged, cached is loaded into memory and ribbon is rebuilt without needing to re scan directories, find doc strings, process icon file names, etc.

Implementation

After processing paths, files, icons, doc, version,etc all the objects are pickled/serialized and dumped into a cache folder and stored with with some sort of hash to identify the specific contents. Next time revit starts, it first checks if hash derived from new files match the pickled file. if it has not changed, it, just reload those objects into memory without having to reprocess file tree.

Hash

could be generated from folder sizes, mod date, or perhaps just ziping the entire folder, and computing a md5 hash from the zip file (probably the most effective and efficient if folder file size is small).

Thoughts?

eirannejad commented 8 years ago

Is there any way that the __init__.py script could be compiled to __init__.pyc binary at the first load and run from that the next time to improve execution time?

Caching/Hashing in a good idea but I need to run numbers to see if it's really more efficient than the current system. I've already modified the __init__ to open each script only once so that's already taken care of.

But I agree with you. The loading should be like other addins...Let's keep this discussion open.

eirannejad commented 8 years ago

Another option is to completely rewrite the __init__ in C# as I mentioned before. What are your thoughts on that?

eirannejad commented 8 years ago

I take that back. Caching is better and we can get it done quicker.

eirannejad commented 8 years ago

Okay. I'm working on JSON serializing the tab, panel, group, and script objects I'm creating during the init process so the next runs can quickly unwrap the objects from the json file and create the UI.

I need to figure out a way to determine if anything has changed. I'm thinking of finding tabs and creating a tuple (folder_size, script_count, icon_count, subfolder_count) signature...and if the same as recorded signature for the tab, it'll just load from json.

Individual tabs will get their own json file and are cached independently.

What do you think?

gtalarico commented 8 years ago

@eirannejad can you not pickle the entire PyRevitUISession object? that was my initial thought.

as for comparing changes: I think counting size, items, etc would work, but my gut instict is that zipping+hash would be the fastest way

eirannejad commented 8 years ago

Agreed. I'll start with serializing the whole session and using the md5 function for directory hash.

gtalarico commented 8 years ago

Just tested this. It's really fast (have to ignore .git because that directory has hundreds of files, and that adds about 2 seconds. ) Problem is, hash is coming out different every time, not sure why... :confused:

Done in: 0.032999515533447266 Hash: fd364c9d1465cc66b37a70177f89e0d3

# http://stackoverflow.com/questions/1855095/how-to-create-a-zip-archive-of-a-directory
import os
import zipfile
import hashlib
import time

def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        if '.git' not in root:
            for file in files:
                ziph.write(os.path.join(root, file))

t0 = time.time()

zipf = zipfile.ZipFile('test.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir('.', zipf)
zipf.close()

file_hash =  hashlib.md5(open('test.zip', 'rb').read()).hexdigest()

t1 = time.time()
total = t1-t0

print('Done in:' , total)
print('Hash:' , file_hash)
eirannejad commented 8 years ago

How's the speed on the zipping?

eirannejad commented 8 years ago

my md5 hash comes out the same every time...:/

gtalarico commented 8 years ago

nevermind, working now. I was checking the whole folder and it was changing because the zip was being added.

zipping + hashing takes 0.032999515533447266 sec


# http://stackoverflow.com/questions/1855095/how-to-create-a-zip-archive-of-a-directory
import os
import zipfile
import hashlib
import time

def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        if '.git' not in root:
            for file in files:
                ziph.write(os.path.join(root, file))

t0 = time.time()

zipf = zipfile.ZipFile('pyrevitplustab.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir('pyrevitplus.tab', zipf)
zipf.close()

file_hash =  hashlib.md5(open('pyrevitplustab.zip', 'rb').read()).hexdigest()

t1 = time.time()
total = t1-t0

print('Done in:' , total)
print('Hash:' , file_hash)
eirannejad commented 8 years ago

Okay!...I'll use this. Thanks

gtalarico commented 8 years ago

awesome. on pickling, I have ran into some issues pickling complex objects. hopefully it will work. also, I was just reading the pickle docs, looks like cpickle is a lot faster, and seems to work on ironpython

image

eirannejad commented 8 years ago

I'm thinking of using JSON since it's human readable and easier to debug. I kinda like the idea that the serialized session is a uniform format that is readable by other processes as well.

Let's see how fast it is..

gtalarico commented 8 years ago

sounds good. hope it works.

Switching gears to another optimization opportunity: report()

1

I just notice that even with verbose off, the "|" character that indicates activity, it's adding a lot of overhead. Removing that single print line alone decreased load time by ~60%. from 3.3 sec to 2.1. I think that's a huge improvement for such a small change If we can get load time to < 2 sec, who needs a progress bar :)

2

Is there a reason you are using a custom logic (report, reportv) instead of the built in logger module? it would remove all the conditional statements (ie: if verbose ) every time a report is called, which is 100+ times. (would just use logger.info, logger.debug, and then logger.setLovel(INFO/DEBUG) Would be a pain to refactor, and if caching works, probably not worth it anyway!

image

eirannejad commented 8 years ago

1

Honestly I have a problem. If I disable the verbose reporting all together my Revit closes at startup with no message. I’ve been looking into the code but can’t figure out how does, NOT printing a message causes Revit to crash.

2

Because I’m not as good as you dude! :) I have learned python by myself by trial and errror and never had the time to dive into the more advanced modules as much. I’m open to all suggestions though! and appreciate any help I can get :) Thanks for telling me about the logging module btw. I’ll look into it.

On Sep 10, 2016, at 17:01, Gui Talarico notifications@github.com wrote:

sounds good. hope it works.

Switching gears to another optimization opportunity: report()

1 https://github.com/eirannejad/pyRevit/issues/1

I just notice that even with verbose off, the "|" character that indicates activity, it's adding a lot of overhead. Removing that single print line alone decreased load time by ~60%. from 3.3 sec to 2.1. I think that's a huge improvement for such a small change If we can get load time to < 2 sec, who needs a progress bar :)

2 https://github.com/eirannejad/pyRevit/issues/2

Is there a reason you are using a custom logic (report, reportv) instead of the built in logger module? it would remove all the conditional statements (ie: if verbose ) every time a report is called, which is 100+ times. (would just use logger.info, logger.debug, and then logger.setLovel(INFO/DEBUG) Would be a pain to refactor, and if caching works, probably not worth it anyway!

https://cloud.githubusercontent.com/assets/9513968/18414215/e32bb7a0-778e-11e6-91b2-db7fb31a1c6c.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eirannejad/pyRevit/issues/123#issuecomment-246153152, or mute the thread https://github.com/notifications/unsubscribe-auth/AH0XHGtSwQw4QGFK0AgvncTjPjfzhPEtks5qo0TngaJpZM4J4Xl6.

gtalarico commented 8 years ago
  1. That's bizarre. Just setting verbose to off crashes it?
  2. Thanks but not true! Half of the stuff you have done I can't even figure out how you did it :) As for logging, I only picked up because when you do web servers you have to have a good handle on logging otherwise you can't track it. The package manager has a simple implementation you can look at. But I don't think it's worth changing this now. I was just curious if you had a reason for it.
gtalarico commented 8 years ago

hey @eirannejad I have been toying with the loader, caching, etc. I played with a few methods for checking for dir change. Although zip+hash is fairly efficient, I think I found a faster method. It just sums the mod time in seconds of all relevant files and folders.

def get_hash_from_dir(script_dir):
    "Creates a unique hash # to represent state of directory."
    logger.info('Generating Hash of directory')
    pat = r'(\.panel)|(\.tab)|(\.png)|(\.py)'
    hash_sum = 0
    for root, dirs, files in os.walk(script_dir):
        if re.search(pat, root, flags=re.IGNORECASE):
            hash_sum += os.path.getmtime(root)
            for filename in files:
                modtime = path.getmtime(path.join(root, filename))
                hash_sum += modtime
    return hash_sum
eirannejad commented 8 years ago

Okay. A working cache system is implemented in loaderCaching branch. It saves and reads JSON caches for each tab it can find. Cut down load time to 7 seconds in verbose reporting and 3.5s in non-verbose. Thanks for the hash algorithms. For consistency I'm making a md5 from the hash_sum to create a more standard hash.

Switch branch in your __init__ folder to loaderCaching and test it out.