minimalparts / PeARS

Archive repository for the PeARS project. Please head over to https://github.com/PeARSearch/PeARS-orchard for the latest version.
MIT License
17 stars 21 forks source link

Make the scripts in scorePages faster. #15

Open stultus opened 9 years ago

stultus commented 9 years ago

Right now 'scoreDocs' and 'runScript' are taking around 13 seconds and 24 seconds.

nandajavarma commented 9 years ago

I think this has to be prioritized before other enhancements.

minimalparts commented 9 years ago

Absolutely! For a start, the wikiwoods.dm file should just be loaded once. At the moment, it gets loaded every time findBestPear is called -- and even worse, every time a pear is looked at in scorePages (so 3 more times). On my machine, it takes around 2s to load, so that's already 8s gone... :(

stultus commented 9 years ago

@minimalparts the wikiwoods.dm is created manually (using some tool) right?, what is your opinion about converting it into an sqlite table and querying it?

minimalparts commented 9 years ago

Yes, absolutely!

minimalparts commented 9 years ago

Same issue with the doc.dists files. See for example http://aurelieherbelot.net/pears-demo/pearone/doc.dists.txt. But I have no idea... can we also convert those to sqlite and have them downloadable from a website?

minimalparts commented 9 years ago

Actually, I'm talking rubbish, wikiwoods.dm is only called once in scorePages, but that's also totally unnecessary, because it recalculates the distribution of the query, which has already been done in findBestPears. Who wrote this thing? ;-)

I guess what we want is: load wikiwoods.dm when launching the application. Calculate the query's distribution (mkQueryDist) once, in findBestPears, and load the doc.dists files in scorePages.

stultus commented 9 years ago

PR #22 introduces an sqlite database for wikiwoods. lets see how this goes