ukwa / w3act

w3act is an annotation and curation tool for building web archive collections
Apache License 2.0
19 stars 6 forks source link

Auto-populate Instances for a given Target based on what's in Wayback #231

Closed anjackson closed 9 years ago

anjackson commented 9 years ago

The original intention was that W3ACT would take the 'lead' URL for a site and auto-populate the instances based on what's in Wayback. e.g.

http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.bl.uk/bibliographic/ukmarc.html

Similarly, our internal Wayback instance has an XML query endpoint that allows you to look up the dates we have instances for.

The idea would be to automatically scan for new instances when a Target is visited, but to also have a special URL we can call every night that checks for new instances of all Targets.

kinmanli commented 9 years ago

@GilHoggarth as @anjackson stated putting this in the script works. Checked in.

-Dconfig.file=conf/prod.conf

Don't know why I even set a $PLAY_HOME in the script. Was never used. my bad

GilHoggarth commented 9 years ago

@kinmanli I changed this to -Dconfig.file=/opt/w3act/conf/prod.conf as full path is needed, and ran - looked like it was running (i.e., too ~20 seconds) then

[root@act05 act]# /root/bin/waybackinstance_import.sh
Working Directory = /root/git/act
Exception in thread "main" javax.persistence.PersistenceException: ERROR executing DML bindLog[] error[ERROR: duplicate key value violates unique constraint "uq_instance_url"\n   Detail: Key (url)=(act-12668) already exists.]
        at com.avaje.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:97)
        at com.avaje.ebeaninternal.server.persist.dml.DmlBeanPersister.update(DmlBeanPersister.java:66)
        at com.avaje.ebeaninternal.server.persist.DefaultPersistExecute.executeUpdateBean(DefaultPersistExecute.java:82)
        at com.avaje.ebeaninternal.server.core.PersistRequestBean.executeNow(PersistRequestBean.java:452)
        at com.avaje.ebeaninternal.server.core.PersistRequestBean.executeOrQueue(PersistRequestBean.java:478)
        at com.avaje.ebeaninternal.server.persist.DefaultPersister.update(DefaultPersister.java:365)
        at com.avaje.ebeaninternal.server.persist.DefaultPersister.saveEnhanced(DefaultPersister.java:308)
        at com.avaje.ebeaninternal.server.persist.DefaultPersister.saveRecurse(DefaultPersister.java:280)
        at com.avaje.ebeaninternal.server.persist.DefaultPersister.save(DefaultPersister.java:248)
        at com.avaje.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1568)
        at com.avaje.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1558)
        at com.avaje.ebean.Ebean.save(Ebean.java:453)
        at play.db.ebean.Model.save(Model.java:91)
        at models.ActModel.save(ActModel.java:47)
        at models.Instance.save(Instance.java:35)
        at uk.bl.export.WaybackExport.bulkTargetImport(WaybackExport.java:71)
        at uk.bl.export.WaybackExport.main(WaybackExport.java:85)
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "uq_instance_url"
  Detail: Key (url)=(act-12668) already exists.
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334)
        at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205)
        at com.avaje.ebeaninternal.server.type.DataBind.executeUpdate(DataBind.java:55)
        at com.avaje.ebeaninternal.server.persist.dml.UpdateHandler.execute(UpdateHandler.java:80)
        at com.avaje.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:86)
        ... 16 more

PS I haven't checked this full path into github. Please do so if this eventually works.

kinmanli commented 9 years ago

@GilHoggarth I don't get that on here.

What happens after that?

GilHoggarth commented 9 years ago

Not sure what you mean - the script died at this point.

kinmanli commented 9 years ago

@GilHoggarth A current Instance already has an "act id" (i.e. act-12668). New ones are generated for the wayback import. Please checkout and redeploy

kinmanli commented 9 years ago

@GilHoggarth did you manage to re-run this?

GilHoggarth commented 9 years ago

I tried it on a redeployed and data imported /actdev but that reported Could not find or load main class uk.bl.export.WaybackExport.

Ran on w3act which as yet hasn't be redeployed and is of course using the original data imported a week ago, and it reported as nearest log dump above. Currently waiting for @peterwebster to tell me when I can redeploy w3act but it won't include a data import.

So, given you mention there's "A current instance already has an "act id" ...", is there another way around testing this that doesn't require a data import?

anjackson commented 9 years ago

I'm concerned there is some confusion here. Are we all clear that this Instance import function is entirely separate from the Target import function?

Target import has been done, and should not be repeated in the future, because instead of going to Andy's ACT we should be deploying test W3ACT versions on top of Postgres DB dumps from production W3ACT.

Instance import should run frequently on production W3ACT, and talks only to Wayback in order to pull in instances of existing Targets.

kinmanli commented 9 years ago

@gilhoggarth @anjackson yes as mentioned above it only imports from wayback and creates instances.

The issue you are getting @GilHoggarth is the classpath in the wayback sh script

kinmanli commented 9 years ago

Hi @GilHoggarth, did you manage to sort the classpath out? My settings are like this:

#!/bin/bash

JAVA_HOME=/usr

# Check envars
if [ ! -d "$JAVA_HOME" ]; then
    echo "JAVA_HOME not defined or not directory. Exiting."
    exit 1
fi

${JAVA_HOME}/bin/java -cp "lib/*" -Dconfig.file=conf/prod.conf uk.bl.export.WaybackExport
peterwebster commented 9 years ago

For me @gilhoggarth , you can redeploy the prod version of W3ACT (without data import) whenever you like, if it's ok with @anjacks0n On 17 Feb 2015 16:03, "kinmanli" notifications@github.com wrote:

Hi @GilHoggarth https://github.com/GilHoggarth, did you manage to sort the classpath out? My settings are like this:

!/bin/bash

JAVA_HOME=/usr

Check envars

if [ ! -d "$JAVA_HOME" ]; then echo "JAVA_HOME not defined or not directory. Exiting." exit 1 fi

${JAVA_HOME}/bin/java -cp "lib/*" -Dconfig.file=conf/prod.conf uk.bl.export.WaybackExport

— Reply to this email directly or view it on GitHub https://github.com/ukwa/w3act/issues/231#issuecomment-74692346.

GilHoggarth commented 9 years ago

@kinmanli It occured to me that /actdev, where I'm trying to test waybackinstance_import.sh wasn't built into a 'play dist', so I've added the classpath details back into the /actdev version. The script has now been running for ~5 minutes on /actdev and hasn't reported an issue.

@peterwebster w3act restarted just now, 4.56pm.

peterwebster commented 9 years ago

@GilHoggarth @kinmanli there seems to be many more instance records today, but I won't be able to test this until the Null Pointer error at #263 is fixed

GilHoggarth commented 9 years ago

@peterwebster @kinmanli That would be me, or at least my action. Ran the waybackinstance_import.sh script on w3act without a failure! It doesn't report anything so I don't know what I did but "many more instance records", assuming relating to wayback, sounds good!

peterwebster commented 9 years ago

So, the most recent instances I'm seeing in /actdev this morning are dated 13 Feb; does that sound about right @PsypherPunk ? I had thought there were some daily crawl schedules, but perhaps they won't have made it into the QA Wayback.

http://www.webarchive.org.uk/actdev/qa/list?f=www.guardian.co.uk

anjackson commented 9 years ago

It should be easy to check this by going to the QA Wayback and checking by hand if the Guardian is up to date.

GilHoggarth commented 9 years ago

Might this be because The Guardian changed their domain? http://www.webarchive.org.uk/actdev/wayback/*/http://www.theguardian.com lists to Feb 22 2015.

peterwebster commented 9 years ago

It may be as @GilHoggarth says that this target is a bad example, having multiple seeds, but it would be worth clarifying (at least) why there are more recent instances at: http://www.webarchive.org.uk/actdev/wayback/*/http://www.theguardian.com/

than listed at http://www.webarchive.org.uk/actdev/instances/listbytarget/?t=790 (the first one in the list)

Any ideas @PsypherPunk @kinmanli @anjackson ?

peterwebster commented 9 years ago

And indeed, all three URLs have instances in Wayback to the 23rd Feb, but ACT think the most recent is 13 Feb.

kinmanli commented 9 years ago

@GilHoggarth when was the last time the instance import ran?

GilHoggarth commented 9 years ago

Guessing you mean /actdev and looking into this I wonder if it's suceeded - the script states prod.conf but of course on /actdev this should be application.conf.

If you mean w3act (https://act) then I think it was a few days ago. The script in github doesn't state the directory for the prod.conf and evidently has been failing. I thought I'd fixed this, but have now fixed it outside of the script.

Would you like the /opt/w3act/waybackinstance_import.sh script to be run now? On /actdev and/or /act?

kinmanli commented 9 years ago

yes please run on actdev @GilHoggarth

kinmanli commented 9 years ago

@peterwebster Does this look correct now http://www.webarchive.org.uk/actdev/instances/listbytarget/?t=790 ? Latest 23/02/2015 after @GilHoggarth did a wayback import to actdev.

screen shot 2015-02-25 at 17 42 00

peterwebster commented 9 years ago

OK, this looks good to me. If @anjackson is happy, could he close the ticket ?

anjackson commented 9 years ago

@GilHoggarth could you add /opt/w3act/waybackinstance_import.sh to the test deployment script? Then this can be closed.

kinmanli commented 9 years ago

@GilHoggarth fixed (needed to run import)

GilHoggarth commented 9 years ago

/actdev is a "play target" service so it's libraries are scattered whereas w3act is a "play dist" service with all the libraries in one directory. So the triggering script (that runs the actual script) is slightly different on /actdev compared to w3act. Anyhoo, this is scheduled on both servers as a late night crontab, and both are being run at the moment to test that they work as expected.

GilHoggarth commented 9 years ago

/actdev script was referencing the prod.conf not application.conf, so changed this and rerunning. If run completes as expected, I'll close this ticket.

GilHoggarth commented 9 years ago

Import on /actdev worked as expected. Closing ticket.

GilHoggarth commented 9 years ago

Reopening so the same test is performed on w3act.

GilHoggarth commented 9 years ago

Ran the waybackinstance_import.sh script on w3act - it logged its progress and claimed "finished". @kinmanli As you know what this does, can you check that it has done what it should do via the UI please.

kinmanli commented 9 years ago

@GilHoggarth if you can make my user an archivist then I'll be able to see the results.

I've just checked the old listing code for instances and the results are only viewable by sysadm and archivist. Is this correct or should it be viewable by all roles @peterwebster?

peterwebster commented 9 years ago

@kinmanli at which point in the UI is the view you're talking about ?

kinmanli commented 9 years ago

@peterwebster

https://www.webarchive.org.uk/act/instances/listbytarget/?t=790 https://www.webarchive.org.uk/act/instances/list/

GilHoggarth commented 9 years ago

@kinmanli Done, you're now an w3act archivist.

kinmanli commented 9 years ago

@GilHoggarth this looks good with the latest ones being 24/02/2015

screen shot 2015-02-27 at 11 00 47

There are 3 from 25/02/2015 on http://www.webarchive.org.uk/actdev/wayback/*/http://www.theguardian.com/ but I guess they will get imported on the next run

anjackson commented 9 years ago

I think instances should be visible to everyone, surely?

nicolabingham commented 9 years ago

Yes, as users at all levels will want to see what has been crawled and when.

GilHoggarth commented 9 years ago

@kinmanli Following Nicola's above comment I've tested https://www.webarchive.org.uk/act/instances/listbytarget/?t=790 to be accessible by the admin user and by me (gil.hoggarth, just user!) I couldn't see the results as me

kinmanli commented 9 years ago

@GilHoggarth @nicolabingham my user with role 'viewer' can now view and search instances.

checked in

screen shot 2015-02-27 at 13 00 24 screen shot 2015-02-27 at 12 58 54

Also, I removed all trailing slashes '/' (after 'listbytarget'). So:

https://www.webarchive.org.uk/act/instances/listbytarget/?t=790

Is now:

https://www.webarchive.org.uk/act/instances/listbytarget?t=790

Same for:

https://www.webarchive.org.uk/act/instances/list?f=http%3A%2F%2Fwww.islamic-relief.or
GilHoggarth commented 9 years ago

Note To be tested after new w3act deployment

GilHoggarth commented 9 years ago

https://www.webarchive.org.uk/act/instances/listbytarget?t=790 can now be viewed by a 'user' account, the admin account, but not by a not-logged-in visitor. Running import script again to check new data still comes in...

GilHoggarth commented 9 years ago

Script run completed and new imports seen - closing ticket.