open-research / sumatra

http://neuralensemble.org/sumatra/
BSD 2-Clause "Simplified" License
127 stars 48 forks source link

Performance improvement for smt list #311

Closed babsey closed 8 years ago

babsey commented 8 years ago

cf discussion on #309 and #310

The concept of this PR is to improve performance for smt list. I would like to make this PR step by step. After merging I will continue writing codes for this PR.

The first step is to stop the constant writing in file. Next steps will be limiting amount of records for print or filtering by date, script file etc...

Which of these steps do you desire?

maxalbert commented 8 years ago

I hadn't been aware of smt list writing to a file so far. As far as I can tell the only purpose of the generated file .smt/labels is to allow tab completion of labels (see bin/smt-complete.sh).

I can't comment on the performance implications of writing to a file in smt list. My hunch is that the overhead of this is small compared to the time it takes to extract the records from the database (at least for Django record stores), but I haven't profiled this so I may be wrong.

Regardless, if the purpose of .smt/labels really is to allow tab completion then it seems wrong to have this file generated by smt list because this requires a manual step by the user and the file will easily get out of sync with existing records. Instead, it should be generated and automatically updated by smt run and smt delete commands. @apdavison, do you agree with this or am I missing something?

apdavison commented 8 years ago

As far as I can tell the only purpose of the generated file .smt/labels is to allow tab completion of labels (see bin/smt-complete.sh)

This is correct.

The reason this slows things down is not the writing to file; it is that the current implementation ends up calling project.find_records() twice.

it seems wrong to have this file generated by smt list because this requires a manual step by the user and the file will easily get out of sync with existing records. Instead, it should be generated and automatically updated by smt run and smt delete commands.

I agree with this. The smt run part is easy, we just append to the end of .smt/labels. Deletion is trickier: profiling will be needed to see whether it is more efficient to rewrite the file with the deleted labels removed or just regenerate the entire file.

apdavison commented 8 years ago

Concerning the structure of this PR, I would prefer to have a separate PR for each step.

I suggest renaming this pull request to something like "Make bash completion optional" and opening a new one for filtering records (cf #309).

maxalbert commented 8 years ago

Firstly, if this PR is about making bash completion optional then the command line switch should have a different name to reflect this. (Oh never mind, I just saw that @apdavison already suggested this in a line comment.)

However, making bash completion optional never was the intention of this PR - it was meant to improve performance. I'm not completely opposed to making it optional, but I think we should avoid introducing lots of options just because we can. If it's cheap in terms of performance to keep the bash completion functionality then why have a switch to disable it? (Again, I'm not sure if it really is cheap, but it sounds to me like the performance-critical parts are unrelated to actually writing to the file .smt/labels, which is the only part that affects bash completion).

babsey commented 8 years ago

maxalbert is right. Let's keep it enabled. There is a way for fast writing to a file. Can you tell me if you can confirm an improvement of the last commit?

Sorry for re-renaming the title.

maxalbert commented 8 years ago

I definitely see a speed improvement with @babsey's latest change. In a project with 432 records, smt list takes 60 seconds with project.format_records() and 45 seconds with project.record_store.labels(). I'd be happy to merge this as a first step, although I still think we should break up the writing of .smt/labels into incremental updates in smt run and smt delete.

Generally speaking, 45 seconds for a simple smt list is still pretty sluggish. If we are consistent with keeping .smt/labels up to date then we could consider simply returning its contents unless the --long or --table options are specified, which would at least make a simple smt list make much faster. However, at the end of the day this only fixing the symptoms of the recordstore database being slow so ideally we should tackle the root of the issue.

Btw, if project.record_store.labels() turns out something we intend to use in more places then it could be worth adding a new method project.get_labels() which calls this under the hood - just to have slightly better encapsulation and looser coupling between classes.

apdavison commented 8 years ago

@maxalbert: I agree on all points.

@babsey: please could you squash the commits? Once that's done I can merge this PR, or I could leave it open for further work. Let me know what you would prefer.

babsey commented 8 years ago

I also agree. I wrote a new method get_labels() for the project object.

Futhermore, I made the next step for improved speed of plain smt list command. This should be a very fast. The question is where I can implement these code lines. I am open-minded for your suggestions.

@apdavison: I never squashed commits but if you agree on these steps. I will try to squash these commits

maxalbert commented 8 years ago

@babsey I haven't looked at the latest changes yet, but in terms of how to squash commits: I personally really like using git's interactive rebase for this. If you run

git rebase -i master

This will open up an editor which will show you all the commits on your curent branch that are on top of master, and each of them is prepended by "pick". If you keep the first one as "pick" and change all the subsequent ones to "s" or"squash" then this will combine all the commits on top of master into a single one. (Maybe make a backup of your repo before you try it the first time just in case something goes wrong.)

Once you have squashed the commits you will need to push the branch to Github again:

git push --force origin per_impr_CLI

The --force is important, otherwise git will complain because you are trying to push some changes that don't build upon changes that are already present on Github.