ukwa / w3act

w3act is an annotation and curation tool for building web archive collections
Apache License 2.0
19 stars 6 forks source link

Bugs in data import from 'Andy's ACT' #74

Closed peterwebster closed 10 years ago

peterwebster commented 10 years ago

As raised by Rav, there are some issues with data being successfully imported from Andy's ACT to /actdev on each deployment. Gil, you and I discussed some sort of visual check table-to-table to see how widespread the issue is, and whether it affects other fields that we haven't yet spotted. @anjackson

The details are:

(i) Relating to NPLD scope: LD criteria notes, Postal Address URL, Notes (under Via Correspondence), are not being migrated.
eg. in NPLD scope tab the Postal Address URL and the Notes (under Via Correspondence) have not been migrated from Andy’s ACT record: http://www.webarchive.org.uk/act/node/8880 to http://www.webarchive.org.uk/actdev/targets/act-8880

(ii) In Crawl Policy and Schedule tab All values for field Scope seem to be being imported as ‘Just this URL’, it should be the same as Andy’s ACT - most records there have value ‘All URLs that start like this’

The Depth setting also doesn’t look like it’s porting over;

(iii) Crawl Start Date seems to show a 10-digit number eg. Academia Rossica http://www.webarchive.org.uk/actdev/targets/act-10508/edit

rgraf commented 10 years ago

This issue is partially solved, but some questions remain.

I have examined this issue for this record: http://www.webarchive.org.uk/act/node/8880.json

I have fixed the issue of Postal Address URL (point (i)) and the display of start and end dates (point (iii)).

Regarding other aspects of point(i) and point (ii), it appears that the presentation of the object in the old version of ACT does not match the JSON export object. For example:

I have therefore linked presentation of ‘field_notes’ to the field ‘value’ in database since that seems to be the correct field according to its content .

Regarding these values, should we understand that there is a mapping between values stored in the database and those displayed? In this case we need some documentation of these mappings.

anjackson commented 10 years ago

The mappings are:

resource|Just this URL.
plus1|This URL plus any directly linked resources.
root|All URLs that start like this.
subdomains|All URLs that match match this host or any subdomains.
rgraf commented 10 years ago

I have implemented and tested scope mapping. In order to complete this issue I would need also similar mappings for depth.

anjackson commented 10 years ago

Depth:

capped|Capped (small - 500MB)
capped_large|Capped (large - 2GB)
deep|Uncapped
rgraf commented 10 years ago

I have implemented all fields and mappings for target view and also similarly for instance view. I believe the issue is solved. Please let me know if there are any additional problems associated with this issue.

peterwebster commented 10 years ago

Thanks Roman; closing ticket.

GilHoggarth commented 10 years ago

This github ticket had two significant parts: there were some data import details that were missing, and there was the request to examine the original /act data against the new /actdev data.

As far as I understand, Roman has implemented the necessary changes for the first part (though I personally haven't checked this). Separately I've been looking into the data comparision - this has been difficult, and I will attempt to explain next.

GilHoggarth commented 10 years ago

The data held in "Andy's ACT", that is the currently live /act service, is stored in a database managed by a service layer. The database structure was defined by the service layer, not by Andy or anyone else. Plus, the ACT data is stored along with the data for the service layer (i.e., data which has nothing to do with ACT). Consequently, the original ACT data and the data imported into w3act are very difficult to compare at a database level.

If this comparison is still needed, a 'like-for-like' mapping between the original ACT data and the w3act data would be needed. As I have been unable to find such a mapping, I will leave this ticket closed. Any data discrepancies found will be added as individual field-level github tickets.

GilHoggarth commented 10 years ago

Ticket re-opened as I can do a comparison of data exported from "Andy's ACT" and /actdev. Roger is doing gathering the data exports for me; I'll then do the comparison.

anjackson commented 10 years ago

Now testing content via the exports, results being noted in #31.

peterwebster commented 10 years ago

OK, will close this if it is now being covered at #31