mischkew / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

PageIterator uses slow limit a,b statements in queries #57

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
See report by Samy Ateia in the Google Group:
http://groups.google.com/group/jwpl/browse_thread/thread/79393bdd9fb84de9
(somewhere in the middle of the thread)

Hi Oliver thanks for your quick replies and changes!
I found a new Issue that is costing a lot of time while iterating. The
Hibernate statement in PageIterator.PageBuffer.fillBuffer():

returnValues =
session.createCriteria(de.tudarmstadt.ukp.wikipedia.api.hibernate.Page.class)
                    .add(Restrictions.eq("isDisambiguation", false))
                    .setFirstResult(dataOffset)
                    .setMaxResults(maxBufferSize)
                    .list();

gets really slow when the dataOffset gets bigger. At arround 100000
pages it takes more than 2 minutes!!! to retrieve the next
#maxBufferSize Pages!

It creates the following mysql statements (:
select this_.id as id5_0_, this_.pageId as pageId5_0_, this_.name as
name5_0_, this_.text as text5_0_, this_.isDisambiguation as
isDisamb5_5_0_ from Page this_ limit 40000
select this_.id as id5_0_, this_.pageId as pageId5_0_, this_.name as
name5_0_, this_.text as text5_0_, this_.isDisambiguation as
isDisamb5_5_0_ from Page this_ limit 40000, 40000
select this_.id as id5_0_, this_.pageId as pageId5_0_, this_.name as
name5_0_, this_.text as text5_0_, this_.isDisambiguation as
isDisamb5_5_0_ from Page this_ limit 80000, 40000
select this_.id as id5_0_, this_.pageId as pageId5_0_, this_.name as
name5_0_, this_.text as text5_0_, this_.isDisambiguation as
isDisamb5_5_0_ from Page this_ limit 120000, 40000
and so on

there seems to be a common performance issue with statements that
involve "limit a, b":
http://www.facebook.com/note.php?note_id=206034210932
http://forums.mysql.com/read.php?24,112440,112440
http://stackoverflow.com/questions/2826319/using-hibernates-scrollabl...

The simplest solution seems to be, to avoid limit a,b and just use
limit b and set the offset with a where clause. Which translates in
hibernate to avoid .setFirstResult(a) and only use .setMaxResults(b).

I tried the following Statement and the time for the database
transaction doesn't seem to increase. It rather seems to decrease:
Starts with ~22 seconds for the first 40000 pages (same as the old
query) and takes around 5 seconds for the 40000 pages after 1 million:

returnValues =
session.createCriteria(de.tudarmstadt.ukp.wikipedia.api.hibernate.Page.class)
                    .add(Restrictions.eq("isDisambiguation",
false)).add(Restrictions.gt("id", lastPage))
                    .setMaxResults(maxBufferSize)
                    .list();

this creates the following statements:
select this_.id as id5_0_, this_.pageId as pageId5_0_, this_.name as
name5_0_, this_.text as text5_0_, this_.isDisambiguation as
isDisamb5_5_0_ from Page this_ where this_.id>0 limit 40000
select this_.id as id5_0_, this_.pageId as pageId5_0_, this_.name as
name5_0_, this_.text as text5_0_, this_.isDisambiguation as
isDisamb5_5_0_ from Page this_ where this_.id>82296 limit 40000
select this_.id as id5_0_, this_.pageId as pageId5_0_, this_.name as
name5_0_, this_.text as text5_0_, this_.isDisambiguation as
isDisamb5_5_0_ from Page this_ where this_.id>136907 limit 40000
and so on.

I will test those changes some more tomorrow and try to supply a
patch, I think they look very promising. 

Original issue reported on code.google.com by oliver.ferschke on 8 Sep 2011 at 9:38

GoogleCodeExporter commented 9 years ago
Fixed in r455 
Thanks to the patch provided by Samy Ateia

Original comment by oliver.ferschke on 8 Sep 2011 at 9:39

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 16 Feb 2012 at 1:19