Closed peterwebster closed 9 years ago
License tab is highlighted based on having a "granted" license. Same as previous w3act https://github.com/ukwa/w3act/blob/master/app/views/targets/view.scala.html#L96-L100
NPLD tab highlighted - based on the previous rules
Hi @kinmanli : in the licensing tab, it needs to check the License field (not the Open UKWA License Requests field), because some of the data in the License field is either inherited from legacy systems or input manually. The Requests fields will only have data for targets where the license process native to W3ACT has been used. So: green plus sign when the License field is not null.
See http://www.webarchive.org.uk/actdev/targets/15462#licensing
@peterwebster it was highlighting it based on the license fields not being null and status being "Granted". I've removed the status check and it works now.
fix should be in next deploy
Hi @kinmanli : I think this got lost along the way, and so /actbeta wasn't the right model to follow in this case.
There are:
(i) targets with UKWA Selective Archive licences (data imported from SPT, the legacy system.) There won't be any more of these. (ii) OpenUKWA Licences - obtained by the process native to W3ACT (iii) licenses like the HLF WWI, which are obtained by a agreement outside ACT and applied manually by the Archivist to a set of targets to which it applies. There may be other such agreements like this one.
Yes, it is in principle possible to have a target to which more than one licence applies - probably where we either have (i) or obtain (ii) individually, and then subsequently strike an agreement that covers it. We wouldn't pursue (ii) if we already had (i).
It should only the Archivist who can edit the field, BTW (it may already be, I haven't checked)
It relates to (closed ) #176
Hi @kinmanli this one looks fine now: http://www.webarchive.org.uk/actdev/targets/15462#licensing
but there's still an issue here, with the NPLD + / i - it should be negative. http://www.webarchive.org.uk/actdev/targets/16117#crawlpermission
@peterwebster please can you post the conditions required for the NPLD +/-? I am just going by previous code and don't know if it is correct?
Thanks
Which is a bit terrifying, as implementing the NPLD criteria is critical logic. Basically, there are a list of distinct conditions of which any one must be true. Hopefully @peterwebster can verify, but of the top of my head, I believe they are:
The crucial point is that it is a long set of OR operations. i.e. if it's *.uk then it's in scope, OR if it's .london, it's in scope. For multiple URLs, all of them must meet one of the conditions.
@anjackson @kinmanli
The four above, plus: (5) Postal Address (set manually to Yes by the user) (6) By Correspondence (also set manually)
In the UI, if any of the three manual tests (these two or professional judgement) is set to Yes, then the following field (which provide the evidence) are then required.
ie. if Postal Address is Yes, then Postal Address URL is required (doesn't need validating) etc
Fixed and checked in.
1) Domain name ends in known UK territory (.uk, .london, and .scot I think?) - target.isTopLevelDomain() 2) Is hosted in UK (automatic via GeoIP DB lookup) - target.isUkHosting() 3) Is registered in UK (all that whois stuff) - target.isUkRegistration() these 3 under target.checkManualScope() 4) Professional Judgement. 5) Postal Address (set manually to Yes by the user) 6) By Correspondence (also set manually)
I've also written a few tests to test valid/invalid multiple URLs (UkHostingTest.java, TopLevelDomainTest
@kinmanli to implement the validation on the three text field for the three manual test
@peterwebster to test the logic
OK: the + / - indicators working well for the three manual tests. http://www.webarchive.org.uk/actdev/targets/1
@kinmanli could you double-check that the WHois check for UK registration is working correctly?
I would expect this target to be positive, as Whois.net has it registered to a UK company and address.
http://www.fishbournepreschool.org.uk/
Unless the whois check is a cron job, and so hasn't happened yet (I only just created the target.)
http://www.webarchive.org.uk/actdev/targets/2#crawlpermission
Otherwise, the TLD and hosting check seems to be working well.
A question for @anjackson .
@kinmanli discussed the logic here in relation to multiple URLs - that all must pass a test, but not necessarily the same one. (so, a co.uk URL passes one test, but a .com might pass the registration or hosting test.)
This is fine for the three automated tests, but to which of the URLs do the three manual tests refer ? If a user is adding multiple URLs to capture the same "site", then I think we are assuming that they are certifying that the Professional Judgement and the Correspondence tests are applicable to the whole site (all the seed URLs), and probably the Postal Address one as well.
I'm OK with this, but we should be sure that we agree.
@anjackson it will affect the logic in the code if this is really what we mean, ie.
If PostalAddress, ProfJudgement and Correspondence are all NO, then all URLs must pass one of TLD/Reg/Hosting.
If one of PostalAddress, ProfJudgment, Correspondence is Yes, then all URLs are clear.
@peterwebster I just created the Target http://www.fishbournepreschool.org.uk/ manually and get the following:
@peterwebster @anjackson
These are the steps to get a +/-. If the previous step is already a "+" then it doesn't need to check the rest of the steps.
1) If (all) URL(s) is "TopLevelDomain" (uk/london/scot) - "+" 2) If (all) URL(s) is "UkHosting" (geoip) - "+" 3) If (all) URL(s) is "UkRegistration (whois) - "+" 4) if professionalJudgement is True OR ukPostalAddress is True OR viaCorrespondence is True - "+"
@kinmanli I'm not sure why this one, with the same seed URL, shows a different result:
Will discuss the + / - logic tomorrow with @anjackson
In pseudo-code:
# Count the number of the URLs for this Target that meet one of the automatic acceptance criteria:
int inScopeUrls = 0;
for url in urls:
if( TopLevelDomain(url) OR
UkHosting(url) OR
TopLevelDomain(url) ) inScopeUrls += 1;
# Do all the URLs meet at least on of the automated criteria?
if( inScopeUrls == length(urls) ) {
return TRUE;
} else if ( professionalJudgement OR ukPostalAddress OR viaCorrespondence ) {
return TRUE;
} else {
return FALSE;
}
@PsypherPunk does that look right?
I think the logic is correct. Although if any of the manual (professionalJudgement OR ukPostalAddress OR viaCorrespondence
) flags are set you might as well skip the checks and return True
.
Similary, if any of the URLs don't meet automated acceptance criteria (UKTopLevelDomain(url) OR UkHosting(url) OR TopLevelDomain(url)
), the whole Target is ruled out of scope so there's no need to check the rest.
Although on this last point, do we feed this back to the Target owner?
The code is something like this at present
https://github.com/ukwa/w3act/blob/schemarefactor/app/models/Target.java#L1841-L1888
boolean res = false;
if (!res) {
res = this.isTopLevelDomain;
}
if (!res) {
res = this.isUkHosting;
}
if (!res) {
res = this.isUkRegistration;
}
if (!res) {
if (BooleanUtils.isTrue(this.professionalJudgement) || BooleanUtils.isTrue(this.ukPostalAddress) || BooleanUtils.isTrue(this.viaCorrespondence)) {
res = true;
}
}
The code as it stands could (I think) be replaced by:
if (this.checkManualScope())
return true;
return (this.isTopLevelDomain || this.isUkHosting || this.isUkRegistration);
However, it doesn't seem to check on a per-URL basis but instead at the Target level (i.e.this
). While the checkManualScope()
can essentially override the rest at theTarget level, those latter methods (isTopLevelDomain
, isUkHosting
and isUkRegistration
) need to take every URL for that Target into account.
@PsypherPunk The per URL check is done on the save.
Give this a once over
https://github.com/ukwa/w3act/blob/schemarefactor/app/uk/bl/scope/Scope.java#L759-L787
public boolean isUkHosting(Target target) {
for (FieldUrl fieldUrl : target.fieldUrls) {
if (!this.checkGeoIp(fieldUrl.url)) return false;
}
return true;
}
public boolean isTopLevelDomain(Target target) throws WhoisException, MalformedURLException, URISyntaxException {
for (FieldUrl fieldUrl : target.fieldUrls) {
URL uri = new URI(fieldUrl.url).normalize().toURL();
String url = uri.toExternalForm();
if (!url.contains(UK_DOMAIN) && !url.contains(LONDON_DOMAIN) && !url.contains(SCOT_DOMAIN)) return false;
}
return true;
}
public boolean isUkRegistration(Target target) throws WhoisException {
for (FieldUrl fieldUrl : target.fieldUrls) {
if (!checkWhois(fieldUrl.url, target)) return false;
}
return true;
}
To @PsypherPunk 's point about informing the target owner if the target they have just created fails the three automated tests: it wasn't identified as a requirement, but it perhaps should have been. We'd need to think about how best to do so, in order to prompt the user to look into the three manual tests. @anjackson
If they're updated on save then yes, I think that should suffice.
The only one that's questionable is isUkHosting
which might change whenever we update the geo-IP database. Is there a means to trigger a recheck if that happens?
@PsypherPunk So as it stands does that mean besides a bit of code moving. Everything is good?
Yes, I think it's fine.
I'm going to address the hosting/IP issue in another ticket (might not necessarily be a solution intenral to w3act).
The three manual tests are reflected correctly at: http://www.webarchive.org.uk/actdev/targets/16936
UK TLD test being correctly applied to all URLs, and returning false if all URLs are not UK TLD.
UK Host Test and UK TLD both correctly returning false in this target, where one URL passes, and the second does not: http://www.webarchive.org.uk/actdev/targets/16937#crawlpermission
Just one question for me @kinmanli : when does the UK Registration check take place ? I ask because I would expect this target to pass, but it fails now, immediately after creating the record http://www.webarchive.org.uk/actdev/targets/16939
If you look up above, you yourself once created a target for this (now lost) which passed, and I know that it hasn't moved in the interim.
recreate this
@peterwebster. I can't seem to recreate it with the same data.
local
actdev
@kinmanli OK, try this one: this record has been imported from Andy's ACT, and should definitely pass the Reg check: http://www.webarchive.org.uk/actdev/targets/1814#crawlpermission
I opened up the record and edited and saved, and the check still fails. Something going on in /actdev ?
@peterwebster @anjackson
This is what originally happens when it cannot find a whois server for `http://www.westsussex.gov.uk/' and returns a false/no
Ran a test with:
JRubyWhois whoIs = new JRubyWhois();
WhoisResult whoIsRes = whoIs.lookup("http://www.westsussex.gov.uk");
Whois::ServerNotFound: Unable to find a WHOIS server for `http://www.westsussex.gov.uk/'
guess at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/server.rb:244
lookup at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/client.rb:92
timeout at org/jruby/ext/timeout/Timeout.java:127
lookup at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/client.rb:91
(root) at jruby-whois.rb:10
Hm, I would have expected the call to be more like:
JRubyWhois whoIs = new JRubyWhois();
WhoisResult whoIsRes = whoIs.lookup("www.westsussex.gov.uk");
i.e. just passing in the host, not the URL.
Passing in the host gives:
checkWhois: uk.bl.wa.whois.JRubyWhois@157fd320 www.westsussex.gov.uk
Whois::AttributeNotImplemented: Unable to find a parser for property `registrant_contacts'
delegate_property_to_parsers at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:316
registrant_contacts at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:282
method_missing at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:299
(root) at jruby-whois.rb:15
That looks fairly serious. Do we know if this has ever worked?
@anjackson yes it used to work and worked for other 'URLs' but right now the Test I wrote is failing.
@kinmanli And the test is running with the full, normal classpath, right?
What about just passing in "westsussex.gov.uk"? I notice the W3ACT code attempts to strip down to the domain name by removing any leading www.
@anjackson normal classpath within eclipse
Two URL tests
checkWhois: uk.bl.wa.whois.JRubyWhois@6b480318 islamic-relief.org.uk
isUKRegistrant: true
checkWhois: uk.bl.wa.whois.JRubyWhois@3e5e1a04 westsussex.gov.uk
Whois::AttributeNotImplemented: Unable to find a parser for property `registrant_contacts'
delegate_property_to_parsers at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:316
registrant_contacts at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:282
(root) at jruby-whois.rb:15
Also, using "www.westsussex.gov.uk"
checkWhois: uk.bl.wa.whois.JRubyWhois@5b6a0720 www.westsussex.gov.uk
Whois::AttributeNotImplemented: Unable to find a parser for property `registrant_contacts'
delegate_property_to_parsers at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:316
registrant_contacts at jar:file:C:/projects/ukwa/w3actSchemaRefactor/lib/jruby-whois-3.4.2.2-SNAPSHOT.jar!/gems/whois-3.4.2/lib/whois/record/parser.rb:282
(root) at jruby-whois.rb:15
Ah, okay, so it can't cope with the gov.uk registrar, I guess. It can't be expected to cope with every registrar, I'm afraid. Unless this specific URL used to work and has not somehow changed registrar?
@peterwebster did http://www.webarchive.org.uk/actdev/targets/1814 's uk registration used to be 'true'?
@kinmanli I don't know, haven't used this example before
@peterwebster @anjackson so how do you want to progress this if it cannot handle .gov sites?
@kinmanli @anjackson we can discuss this at 2; but perhaps a fail which should have passed is acceptable - what we can't have is a pass that should fail.
Both the NLPD and UKWA licensing tabs have a + / - in the tab.
These indicate when:
(i) at least one NPLD criterion is met;
and
(ii) at least one licence is available ie. License field is not null.
Neither of these is working at the minute: http://www.webarchive.org.uk/actdev/targets/16117