ukwa / w3act

w3act is an annotation and curation tool for building web archive collections
Apache License 2.0
19 stars 6 forks source link

+ / - signals in target record #241

Closed peterwebster closed 9 years ago

peterwebster commented 9 years ago

Both the NLPD and UKWA licensing tabs have a + / - in the tab.

These indicate when:

(i) at least one NPLD criterion is met;

and

(ii) at least one licence is available ie. License field is not null.

Neither of these is working at the minute: http://www.webarchive.org.uk/actdev/targets/16117

kinmanli commented 9 years ago

License tab is highlighted based on having a "granted" license. Same as previous w3act https://github.com/ukwa/w3act/blob/master/app/views/targets/view.scala.html#L96-L100

2015-01-26 10-45-03_w3act

kinmanli commented 9 years ago

NPLD tab highlighted - based on the previous rules

2015-01-26 10-59-26_w3act

peterwebster commented 9 years ago

Hi @kinmanli : in the licensing tab, it needs to check the License field (not the Open UKWA License Requests field), because some of the data in the License field is either inherited from legacy systems or input manually. The Requests fields will only have data for targets where the license process native to W3ACT has been used. So: green plus sign when the License field is not null.

See http://www.webarchive.org.uk/actdev/targets/15462#licensing

kinmanli commented 9 years ago

@peterwebster it was highlighting it based on the license fields not being null and status being "Granted". I've removed the status check and it works now.

2015-01-27 16-20-33_w3act

kinmanli commented 9 years ago

fix should be in next deploy

peterwebster commented 9 years ago

Hi @kinmanli : I think this got lost along the way, and so /actbeta wasn't the right model to follow in this case.

There are:

(i) targets with UKWA Selective Archive licences (data imported from SPT, the legacy system.) There won't be any more of these. (ii) OpenUKWA Licences - obtained by the process native to W3ACT (iii) licenses like the HLF WWI, which are obtained by a agreement outside ACT and applied manually by the Archivist to a set of targets to which it applies. There may be other such agreements like this one.

Yes, it is in principle possible to have a target to which more than one licence applies - probably where we either have (i) or obtain (ii) individually, and then subsequently strike an agreement that covers it. We wouldn't pursue (ii) if we already had (i).

It should only the Archivist who can edit the field, BTW (it may already be, I haven't checked)

peterwebster commented 9 years ago

It relates to (closed ) #176

peterwebster commented 9 years ago

Hi @kinmanli this one looks fine now: http://www.webarchive.org.uk/actdev/targets/15462#licensing

but there's still an issue here, with the NPLD + / i - it should be negative. http://www.webarchive.org.uk/actdev/targets/16117#crawlpermission

kinmanli commented 9 years ago

@peterwebster please can you post the conditions required for the NPLD +/-? I am just going by previous code and don't know if it is correct?

Thanks

anjackson commented 9 years ago

Which is a bit terrifying, as implementing the NPLD criteria is critical logic. Basically, there are a list of distinct conditions of which any one must be true. Hopefully @peterwebster can verify, but of the top of my head, I believe they are:

The crucial point is that it is a long set of OR operations. i.e. if it's *.uk then it's in scope, OR if it's .london, it's in scope. For multiple URLs, all of them must meet one of the conditions.

peterwebster commented 9 years ago

@anjackson @kinmanli

The four above, plus: (5) Postal Address (set manually to Yes by the user) (6) By Correspondence (also set manually)

In the UI, if any of the three manual tests (these two or professional judgement) is set to Yes, then the following field (which provide the evidence) are then required.

ie. if Postal Address is Yes, then Postal Address URL is required (doesn't need validating) etc

kinmanli commented 9 years ago

Fixed and checked in.

1) Domain name ends in known UK territory (.uk, .london, and .scot I think?) - target.isTopLevelDomain() 2) Is hosted in UK (automatic via GeoIP DB lookup) - target.isUkHosting() 3) Is registered in UK (all that whois stuff) - target.isUkRegistration() these 3 under target.checkManualScope() 4) Professional Judgement. 5) Postal Address (set manually to Yes by the user) 6) By Correspondence (also set manually)

I've also written a few tests to test valid/invalid multiple URLs (UkHostingTest.java, TopLevelDomainTest

peterwebster commented 9 years ago

@kinmanli to implement the validation on the three text field for the three manual test

@peterwebster to test the logic

peterwebster commented 9 years ago

OK: the + / - indicators working well for the three manual tests. http://www.webarchive.org.uk/actdev/targets/1

peterwebster commented 9 years ago

@kinmanli could you double-check that the WHois check for UK registration is working correctly?

I would expect this target to be positive, as Whois.net has it registered to a UK company and address.

http://www.fishbournepreschool.org.uk/

Unless the whois check is a cron job, and so hasn't happened yet (I only just created the target.)

http://www.webarchive.org.uk/actdev/targets/2#crawlpermission

peterwebster commented 9 years ago

Otherwise, the TLD and hosting check seems to be working well.

peterwebster commented 9 years ago

A question for @anjackson .

@kinmanli discussed the logic here in relation to multiple URLs - that all must pass a test, but not necessarily the same one. (so, a co.uk URL passes one test, but a .com might pass the registration or hosting test.)

This is fine for the three automated tests, but to which of the URLs do the three manual tests refer ? If a user is adding multiple URLs to capture the same "site", then I think we are assuming that they are certifying that the Professional Judgement and the Correspondence tests are applicable to the whole site (all the seed URLs), and probably the Postal Address one as well.

I'm OK with this, but we should be sure that we agree.

peterwebster commented 9 years ago

@anjackson it will affect the logic in the code if this is really what we mean, ie.

If PostalAddress, ProfJudgement and Correspondence are all NO, then all URLs must pass one of TLD/Reg/Hosting.

If one of PostalAddress, ProfJudgment, Correspondence is Yes, then all URLs are clear.

kinmanli commented 9 years ago

@peterwebster I just created the Target http://www.fishbournepreschool.org.uk/ manually and get the following:

2015-02-02 10-12-42_w3act

kinmanli commented 9 years ago

@peterwebster @anjackson

These are the steps to get a +/-. If the previous step is already a "+" then it doesn't need to check the rest of the steps.

1) If (all) URL(s) is "TopLevelDomain" (uk/london/scot) - "+" 2) If (all) URL(s) is "UkHosting" (geoip) - "+" 3) If (all) URL(s) is "UkRegistration (whois) - "+" 4) if professionalJudgement is True OR ukPostalAddress is True OR viaCorrespondence is True - "+"

peterwebster commented 9 years ago

@kinmanli I'm not sure why this one, with the same seed URL, shows a different result:

http://www.webarchive.org.uk/actdev/targets/2

peterwebster commented 9 years ago

Will discuss the + / - logic tomorrow with @anjackson

anjackson commented 9 years ago

In pseudo-code:

# Count the number of the URLs for this Target that meet one of the automatic acceptance criteria:
int inScopeUrls = 0;
for url in urls:
    if( TopLevelDomain(url) OR 
         UkHosting(url) OR 
         TopLevelDomain(url) ) inScopeUrls += 1;
# Do all the URLs meet at least on of the automated criteria?
if( inScopeUrls == length(urls) ) {
     return TRUE;
} else if ( professionalJudgement OR ukPostalAddress OR viaCorrespondence ) {
     return TRUE;
} else {
     return FALSE;
}

@PsypherPunk does that look right?

PsypherPunk commented 9 years ago

I think the logic is correct. Although if any of the manual (professionalJudgement OR ukPostalAddress OR viaCorrespondence) flags are set you might as well skip the checks and return True.

Similary, if any of the URLs don't meet automated acceptance criteria (UKTopLevelDomain(url) OR UkHosting(url) OR TopLevelDomain(url)), the whole Target is ruled out of scope so there's no need to check the rest.

Although on this last point, do we feed this back to the Target owner?

kinmanli commented 9 years ago

The code is something like this at present

https://github.com/ukwa/w3act/blob/schemarefactor/app/models/Target.java#L1841-L1888

        boolean res = false;

        if (!res) {
            res = this.isTopLevelDomain;
        }
        if (!res) {
            res = this.isUkHosting;
        }
        if (!res) {
            res = this.isUkRegistration;
        }
        if (!res) {

            if (BooleanUtils.isTrue(this.professionalJudgement) || BooleanUtils.isTrue(this.ukPostalAddress) || BooleanUtils.isTrue(this.viaCorrespondence)) {
                res = true;
        }
     }
PsypherPunk commented 9 years ago

The code as it stands could (I think) be replaced by:

if (this.checkManualScope())
    return true;

return (this.isTopLevelDomain || this.isUkHosting || this.isUkRegistration);

However, it doesn't seem to check on a per-URL basis but instead at the Target level (i.e.this). While the checkManualScope() can essentially override the rest at theTarget level, those latter methods (isTopLevelDomain, isUkHosting and isUkRegistration) need to take every URL for that Target into account.

kinmanli commented 9 years ago

@PsypherPunk The per URL check is done on the save.

Give this a once over

https://github.com/ukwa/w3act/blob/schemarefactor/app/uk/bl/scope/Scope.java#L759-L787

    public boolean isUkHosting(Target target) {
        for (FieldUrl fieldUrl : target.fieldUrls) {
            if (!this.checkGeoIp(fieldUrl.url)) return false;
        }
        return true;
    }

    public boolean isTopLevelDomain(Target target) throws WhoisException, MalformedURLException, URISyntaxException {
        for (FieldUrl fieldUrl : target.fieldUrls) {
            URL uri = new URI(fieldUrl.url).normalize().toURL();
            String url = uri.toExternalForm();
            if (!url.contains(UK_DOMAIN) && !url.contains(LONDON_DOMAIN) && !url.contains(SCOT_DOMAIN)) return false;
        }
        return true;
    }

    public boolean isUkRegistration(Target target) throws WhoisException {
        for (FieldUrl fieldUrl : target.fieldUrls) {
            if (!checkWhois(fieldUrl.url, target)) return false;
        }
        return true;
    }
peterwebster commented 9 years ago

To @PsypherPunk 's point about informing the target owner if the target they have just created fails the three automated tests: it wasn't identified as a requirement, but it perhaps should have been. We'd need to think about how best to do so, in order to prompt the user to look into the three manual tests. @anjackson

PsypherPunk commented 9 years ago

If they're updated on save then yes, I think that should suffice.

The only one that's questionable is isUkHosting which might change whenever we update the geo-IP database. Is there a means to trigger a recheck if that happens?

kinmanli commented 9 years ago

@PsypherPunk So as it stands does that mean besides a bit of code moving. Everything is good?

PsypherPunk commented 9 years ago

Yes, I think it's fine.

I'm going to address the hosting/IP issue in another ticket (might not necessarily be a solution intenral to w3act).

peterwebster commented 9 years ago

The three manual tests are reflected correctly at: http://www.webarchive.org.uk/actdev/targets/16936

peterwebster commented 9 years ago

UK TLD test being correctly applied to all URLs, and returning false if all URLs are not UK TLD.

peterwebster commented 9 years ago

UK Host Test and UK TLD both correctly returning false in this target, where one URL passes, and the second does not: http://www.webarchive.org.uk/actdev/targets/16937#crawlpermission

peterwebster commented 9 years ago

Just one question for me @kinmanli : when does the UK Registration check take place ? I ask because I would expect this target to pass, but it fails now, immediately after creating the record http://www.webarchive.org.uk/actdev/targets/16939

If you look up above, you yourself once created a target for this (now lost) which passed, and I know that it hasn't moved in the interim.

kinmanli commented 9 years ago

recreate this

kinmanli commented 9 years ago

@peterwebster. I can't seem to recreate it with the same data.

local

screen shot 2015-02-19 at 23 11 17

actdev

screen shot 2015-02-19 at 23 12 00

peterwebster commented 9 years ago

@kinmanli OK, try this one: this record has been imported from Andy's ACT, and should definitely pass the Reg check: http://www.webarchive.org.uk/actdev/targets/1814#crawlpermission

I opened up the record and edited and saved, and the check still fails. Something going on in /actdev ?

kinmanli commented 9 years ago

@peterwebster @anjackson

This is what originally happens when it cannot find a whois server for `http://www.westsussex.gov.uk/' and returns a false/no

Ran a test with:

JRubyWhois whoIs = new JRubyWhois();
WhoisResult whoIsRes = whoIs.lookup("http://www.westsussex.gov.uk");

Whois::ServerNotFound: Unable to find a WHOIS server for `http://www.westsussex.gov.uk/'
    guess at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/server.rb:244
   lookup at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/client.rb:92
  timeout at org/jruby/ext/timeout/Timeout.java:127
   lookup at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/client.rb:91
   (root) at jruby-whois.rb:10
anjackson commented 9 years ago

Hm, I would have expected the call to be more like:

JRubyWhois whoIs = new JRubyWhois();
WhoisResult whoIsRes = whoIs.lookup("www.westsussex.gov.uk");

i.e. just passing in the host, not the URL.

kinmanli commented 9 years ago

Passing in the host gives:

checkWhois: uk.bl.wa.whois.JRubyWhois@157fd320 www.westsussex.gov.uk
Whois::AttributeNotImplemented: Unable to find a parser for property `registrant_contacts'
  delegate_property_to_parsers at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:316
           registrant_contacts at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:282
                method_missing at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:299
                        (root) at jruby-whois.rb:15
anjackson commented 9 years ago

That looks fairly serious. Do we know if this has ever worked?

kinmanli commented 9 years ago

@anjackson yes it used to work and worked for other 'URLs' but right now the Test I wrote is failing.

anjackson commented 9 years ago

@kinmanli And the test is running with the full, normal classpath, right?

What about just passing in "westsussex.gov.uk"? I notice the W3ACT code attempts to strip down to the domain name by removing any leading www.

kinmanli commented 9 years ago

@anjackson normal classpath within eclipse

Two URL tests

checkWhois: uk.bl.wa.whois.JRubyWhois@6b480318 islamic-relief.org.uk
isUKRegistrant: true
checkWhois: uk.bl.wa.whois.JRubyWhois@3e5e1a04 westsussex.gov.uk
Whois::AttributeNotImplemented: Unable to find a parser for property `registrant_contacts'
  delegate_property_to_parsers at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:316
           registrant_contacts at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:282
                        (root) at jruby-whois.rb:15

Also, using "www.westsussex.gov.uk"

checkWhois: uk.bl.wa.whois.JRubyWhois@5b6a0720 www.westsussex.gov.uk
Whois::AttributeNotImplemented: Unable to find a parser for property `registrant_contacts'
  delegate_property_to_parsers at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:316
           registrant_contacts at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:282
                        (root) at jruby-whois.rb:15
anjackson commented 9 years ago

Ah, okay, so it can't cope with the gov.uk registrar, I guess. It can't be expected to cope with every registrar, I'm afraid. Unless this specific URL used to work and has not somehow changed registrar?

kinmanli commented 9 years ago

@peterwebster did http://www.webarchive.org.uk/actdev/targets/1814 's uk registration used to be 'true'?

peterwebster commented 9 years ago

@kinmanli I don't know, haven't used this example before

kinmanli commented 9 years ago

@peterwebster @anjackson so how do you want to progress this if it cannot handle .gov sites?

peterwebster commented 9 years ago

@kinmanli @anjackson we can discuss this at 2; but perhaps a fail which should have passed is acceptable - what we can't have is a pass that should fail.