ukwa / w3act

w3act is an annotation and curation tool for building web archive collections
Apache License 2.0
19 stars 6 forks source link

Auto-populate Instances for a given Target based on what's in Wayback #231

Closed anjackson closed 9 years ago

anjackson commented 9 years ago

The original intention was that W3ACT would take the 'lead' URL for a site and auto-populate the instances based on what's in Wayback. e.g.

http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.bl.uk/bibliographic/ukmarc.html

Similarly, our internal Wayback instance has an XML query endpoint that allows you to look up the dates we have instances for.

The idea would be to automatically scan for new instances when a Target is visited, but to also have a special URL we can call every night that checks for new instances of all Targets.

anjackson commented 9 years ago

How's that, Peter?

peterwebster commented 9 years ago

@anjackson wondering whether to dynamically check for each target record when opened by user might not slow things down. Perhaps just on creation of a new record ? So:

(i) User creates new target record -> ACT checks for instances of that target in Wayback > creates new instance record for each

AND

(ii) nightly, ACT checks the Wayback XML endpoint for instances of any targets for which there are not existing instance records (ie checking instances with timestamps since the last check) > creates new instance records for each.

@kinmanli how much work do you think this might be ?

kinmanli commented 9 years ago

@anjackson if i was to deal with a Target with a url "https://www.gov.uk/government/publications". Is the url to wayback

"http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=https://www.gov.uk/government/publications" ????

As it gives me "Resource Not In Archive"

or is it the domain?

"http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=https://www.gov.uk/"

Cheers

anjackson commented 9 years ago

Hello, yes, that was the right URL for the public instance, which you can use for testing. Unfortunately, there was a problem with it, but it's fixed now.

However, note that in production, this should be pointing to our internal QA Wayback (which gets to see much more content much earlier than the public Wayback) so the location of this endpoint should be configurable.

peterwebster commented 9 years ago

@kinmanli I created a new target, saved the record; and then hit Get New Instances.

I know this site, and doubt it is in the public Wayback - so this may have returned no new instances - is that why I get?

Execution exception[[RuntimeException: uk.bl.exception.ActException: javax.xml.bind.UnmarshalException - with linked exception: [java.io.FileNotFoundException: http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.fishbournepreschool.org.uk/]]]

There should be a better messaging if the check returns no instances.

peterwebster commented 9 years ago

@anjackson @kinmanli just to confirm that it would be good if this endpoint was the QA Wayback soonish, so I can test the workflows

peterwebster commented 9 years ago

@anjackson @kinmanli is it going to be possible to see the daily check in action before we go into production ? I imagine it might not be straightforward, but I don't think it is running so far - I would expect lots of instances to show up at http://www.webarchive.org.uk/actdev/qa/list?f=bbc.co.uk if it was

kinmanli commented 9 years ago

QA - http://crawler03.bl.uk:8080/wayback/xmlquery.jsp?url=http://www.bbc.co.uk/

kinmanli commented 9 years ago

Friendlier message when there are no new instances or bad url. Also, set to QA wayback url

anjackson commented 9 years ago

@kinmanli to answer @peterwebster's question, is there a hook URL that will look for new instances for all targets?

kinmanli commented 9 years ago

@anjackson @peterwebster I've created a script to do it (was thinking CRON) https://github.com/ukwa/w3act/blob/schemarefactor/waybackinstance_import.sh which calls a class

Can easily attach it to a URL?

anjackson commented 9 years ago

Oh, that's fine then. I was suggesting a web hook because I thought that might be easier, but you've already written a script so that's simpler to cron. Let's ask @GilHoggarth if he'll add an act-dev hook that will run your script nightly...

GilHoggarth commented 9 years ago

So, I should configure this 'waybackinstance_import.sh' script to run nightly. @kinmanli Given /actdev redeploys/restarts daily at 17, when should i schedule this script to run?

kinmanli commented 9 years ago

@GilHoggarth depends when @peterwebster will want to start testing this functionality?

peterwebster commented 9 years ago

@GilHoggarth @kinmanli well, I guess that out of normal hours is the best time, so this evening, after the service has redeployed ? No preference as to exact time

GilHoggarth commented 9 years ago

@kinmanli Looking at the actual 'waybackinstance_import.sh' script, I'm afraid that that won't work. The PLAY_HOME envar won't resolve as this script lives as /opt/w3act/ - so3 levels up is actually to /, and I don't have a /tools/ directory for play 2.2.1 (even though, to clarify, we are using v2.2.1). I'm assuming therefore that your dev environment is different from the server environment.

kinmanli commented 9 years ago

@GilHoggarth Yes just remove my setting "tools" directory in the script and it will tell you to enter your play home directory. PLAY_HOME=

"PLAY_HOME NEEDS TO BE SET"

GilHoggarth commented 9 years ago

@kinmanli But this should be done in the repo not on the server to be included in future deployments. I'd suggest this should be set as PLAY_HOME=/opt/play But I don't know if that's what you want it to be. Please test and deploy as appropriate.

kinmanli commented 9 years ago

@GilHoggarth I've tested it and works locally. Never tested it on your servers.

But will set to /opt/play

peterwebster commented 9 years ago

This also needs documenting in the system installation documentation @kinmanli (if it isn't already)

GilHoggarth commented 9 years ago

Sorry, ticket lost in the mud. Can I test this script now? (Or wait until after the current conference call?)

peterwebster commented 9 years ago

Go ahead @GilHoggarth

GilHoggarth commented 9 years ago

@kinmanli Fraid it immediately dumps a Java exception

[root@actdev01 w3act]# /opt/w3act/waybackinstance_import.sh
:./target/scala-2.10/classes:./conf:./target/scala-2.10/classes_managed:./target/scala-2.10/src_managed:.
Exception in thread "main" java.lang.NoClassDefFoundError: play/db/ebean/Model
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
Caused by: java.lang.ClassNotFoundException: play.db.ebean.Model
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 6 more
kinmanli commented 9 years ago

@GilHoggarth You are missing play libraries from your classpath. Are they not in somewhere like?

/opt/play/framework/sbt/boot/scala-2.10.2/com.typesafe.play/console/2.2.1/console_2.10-2.2.1.jar:

GilHoggarth commented 9 years ago

This jar exists, and the directory has five others.

[root@actdev01 w3act]# ll /opt/play/framework/sbt/boot/scala-2.10.2/com.typesafe.play/console/2.2.1/console_2.10-2.2.1.jar
-rw-r--r-- 1 root root 34643 Oct 30  2013 /opt/play/framework/sbt/boot/scala-2.10.2/com.typesafe.play/console/2.2.1/console_2.10-2.2.1.jar

Ah, so the script envar REPO is set to /opt/play but this is a symlink. Instead PLAY_HOME=/opt/play/ (with a trailing slash) makes it progress, at least to the next issue that reports:

Error: Could not find or load main class uk.bl.export.WaybackExport
kinmanli commented 9 years ago

@GilHoggarth this line should pick up the class "./target/scala-2.10/classes"

CLASSPATH=${CLASSPATH}:./target/scala-2.10/classes:./conf:./target/scala-2.10/classes_managed:./target/scala-2.10/src_managed:.
GilHoggarth commented 9 years ago

The file exists:

[root@actdev01 ~]# find /opt/w3act/ -name "WaybackExport.*" -ls
  3788   28 -rw-r--r--   1 root     root        27690 Feb  3 17:03 /opt/w3act/target/scala-2.10/api/uk/bl/export/WaybackExport.html
  2375    8 -rw-r--r--   1 root     root         5862 Feb  3 17:02 /opt/w3act/target/scala-2.10/classes/uk/bl/export/WaybackExport.class
  3867   28 -rw-r--r--   1 root     root        27690 Feb  3 17:03 /opt/w3act/target/universal/stage/share/doc/api/uk/bl/export/WaybackExport.html
396127    4 -rw-r--r--   1 root     root         2676 Feb  3 17:00 /opt/w3act/app/uk/bl/export/WaybackExport.java

But still, the waybackinstance_import.sh doesn't execute:

/opt/w3act/waybackinstance_import.sh
:./target/scala-2.10/classes:./conf:./target/scala-2.10/classes_managed:./target/scala-2.10/src_managed:.
Error: Could not find or load main class uk.bl.export.WaybackExport
kinmanli commented 9 years ago

@GilHoggarth Try changing the classpath to this.

CLASSPATH=${CLASSPATH}:/opt/w3act/target/scala-2.10/classes:/opt/w3act/conf:/opt/w3act/target/scala-2.10/classes_managed:/opt/w3act/target/scala-2.10/src_managed:.

GilHoggarth commented 9 years ago

Now produces

:/opt/w3act/target/scala-2.10/classes:/opt/w3act/conf:/opt/w3act/target/scala-2.10/classes_managed:/opt/w3act/target/scala-2.10/src_managed:.
Exception in thread "main" java.lang.NoClassDefFoundError: play/db/ebean/Model
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
Caused by: java.lang.ClassNotFoundException: play.db.ebean.Model
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 6 more
kinmanli commented 9 years ago

@GilHoggarth isn't that the same exception you got yesterday with the var REPO?

GilHoggarth commented 9 years ago

Yep, looks very similar...

kinmanli commented 9 years ago

@GilHoggarth what do you get when you

echo $CLASSPATH
GilHoggarth commented 9 years ago

You mean just in a xterm? Nothing, blank, no previously set values.

GilHoggarth commented 9 years ago

Ah, if you mean from the script when run - :/opt/w3act/target/scala-2.10/classes:/opt/w3act/conf:/opt/w3act/target/scala-2.10/classes_managed:/opt/w3act/target/scala-2.10/src_managed:.

kinmanli commented 9 years ago

@GilHoggarth this line is pulling in the play libs on your server

    CLASSPATH=$(find "${REPO}" -name '*.jar' | xargs echo | tr ' ' ':')
GilHoggarth commented 9 years ago

Added back in the ending slash on PLAY_HOME, plus your full path directories in CLASSPATH above. Script now running (after opening 8080 on crawler03). Will schedule to run every night at 10pm.

GilHoggarth commented 9 years ago

@peterwebster Can you test this import wayback instances script tomorrow please (Thursday, or at least after tonight when it's scheduled to run via cron.)

peterwebster commented 9 years ago

Hi @GilHoggarth @kinmanli : I think I would expect to see some instance records showing up here

http://www.webarchive.org.uk/actdev/qa/list?f=bbc.co.uk

and then here:

http://www.webarchive.org.uk/actdev/instances/listbytarget/?s=title&t=2973

I'm not clear whether this means that (a) there are no instance records to show, or (b) there are, but the UI is not looking in the right place for them.

kinmanli commented 9 years ago

When I query my local DB there aren't any Instances for Target 2973. I assume it's be the same for prod? @GilHoggarth

select * from instance where target_id=2973
GilHoggarth commented 9 years ago

Likewise, no instances for this target. And the instance table has 1233 records fyi.

anjackson commented 9 years ago

@GilHoggarth are there any crashes or logging when the cron job runs?

GilHoggarth commented 9 years ago

We're talking about waybackinstance_import.sh I believe. It generates ~/logs/application.log and does not contain any [ERROR] or [WARN]. Cronjob starts at 11pm and log written 11.46pm.

peterwebster commented 9 years ago

@GilHoggarth @kinmanli is it possible to get some numbers of how many new instances it detected for the last few nights? @PsypherPunk will correct me, but I think we have a number of targets on daily schedules, so there should be at least some new targets each night.

GilHoggarth commented 9 years ago

Checking this on the new w3act potential production server, which builds the play application as a standalone application, I see that waybackinstance_import.sh fails to run because it can't find the necessary java class uk.bl.export.WaybackExport. @kinmanli Another class needing identifying in the play dist version (and in this case, inclusion as well as mapping as this class really doesn't exist on the new server.)

kinmanli commented 9 years ago

@GilHoggarth

change the script to:

#!/bin/bash

PLAY_HOME=/opt/play/
JAVA_HOME=/opt/java

# Check envars
if [ -z "$PLAY_HOME" ];then
    echo "PLAY_HOME NEEDS TO BE SET"
    exit 1
fi

if [ ! -d "$JAVA_HOME" ]; then 
    echo "JAVA_HOME not defined or not directory. Exiting."
    exit 1  
fi      

${JAVA_HOME}/bin/java -cp lib\* uk.bl.export.WaybackExport
GilHoggarth commented 9 years ago

Currently this still isn't working:

[root@act05 act]# /root/bin/waybackinstance_import.sh
2015-02-12 16:32:39,824 [error] c.j.b.h.AbstractConnectionHook - Failed to obtain initial connection Sleeping for 0ms and trying again. Attempts left: 0. Exception: null.Message:FATAL: database "w3act2" does not exist
Exception in thread "main" Configuration error: Configuration error[Cannot connect to database [default]]
        at play.api.Configuration$.play$api$Configuration$$configError(Configuration.scala:92)
        at play.api.Configuration.reportError(Configuration.scala:570)
        at play.api.db.BoneCPPlugin$$anonfun$onStart$1.apply(DB.scala:252)
        at play.api.db.BoneCPPlugin$$anonfun$onStart$1.apply(DB.scala:243)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at play.api.db.BoneCPPlugin.onStart(DB.scala:243)
        at play.api.Play$$anonfun$start$1$$anonfun$apply$mcV$sp$1.apply(Play.scala:88)
        at play.api.Play$$anonfun$start$1$$anonfun$apply$mcV$sp$1.apply(Play.scala:88)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at play.api.Play$$anonfun$start$1.apply$mcV$sp(Play.scala:88)
        at play.api.Play$$anonfun$start$1.apply(Play.scala:88)
        at play.api.Play$$anonfun$start$1.apply(Play.scala:88)
        at play.utils.Threads$.withContextClassLoader(Threads.scala:18)
        at play.api.Play$.start(Play.scala:87)
        at play.core.StaticApplication.<init>(ApplicationProvider.scala:52)
        at uk.bl.export.WaybackExport.main(WaybackExport.java:84)
Caused by: org.postgresql.util.PSQLException: FATAL: database "w3act2" does not exist
        at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:471)
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:112)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66)
        at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125)
        at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30)
        at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:22)
        at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:32)
        at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:24)
        at org.postgresql.Driver.makeConnection(Driver.java:393)
        at org.postgresql.Driver.connect(Driver.java:267)
        at java.sql.DriverManager.getConnection(DriverManager.java:571)
        at java.sql.DriverManager.getConnection(DriverManager.java:215)
        at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:363)
        at com.jolbox.bonecp.BoneCP.<init>(BoneCP.java:416)
        at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120)
        at play.api.db.BoneCPPlugin$$anonfun$onStart$1.apply(DB.scala:245)
        ... 17 more
kinmanli commented 9 years ago

@gilhoggarth this is because it is reading in application.conf (which contains w3act2 database). Need to figure out a way to pass in prod.conf to:

new play.core.StaticApplication(new java.io.File("."));

anjackson commented 9 years ago

Setting

-Dconfig.file=/path/to/file.conf

should always work, I think.

kinmanli commented 9 years ago

@anjackson that only works with play files. This is called within code.

Off topic. It usually works but I tried that setting with a play distribution and it kept using the application.conf. When I point to prod.conf

kinmanli commented 9 years ago

From Play

application.conf is read from the classpath. As long as you ensure that the right application.conf is in the root of the classpath, that will work. You can specify a different file to use by setting the config.file and config.resource system properties. This can be done either when invoking the JVM, or programatically.

Other than that, StaticApplication is as it is purposefully. It creates a DefaultApplication, which has very little logic, but mostly just uses mixins to bring in different default functionality, including the functionality to load the configuration. There is nothing stopping you from creating your own ApplicationProvider that instantiated your own default application mixed in with whatever logic you wanted to use to lookup the configuration.