ufal / lindat-repository-obsolete

LINDAT/CLARIN repository for linguistics (http://lindat.cz)
6 stars 1 forks source link

Checklinks curation fails for legitimate URLs #6

Closed loganathanspr closed 10 years ago

loganathanspr commented 10 years ago

for example,

this item has 2 links and both are working https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1323

https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1120 https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1124 https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1149 https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1164

mjos commented 10 years ago

This behavior was caused by missing User-Agent HTTP header. New User-Agent header will be added with value set to "DSpace Link Checker".

mjos commented 10 years ago

Fixed in commit b9061602598a6c47f398f115e7685bcb9521eb93.

loganathanspr commented 10 years ago

I compared the production curation results with the current curation code results ,

There are some issues with it .....


I think for the following item the old code in the production is right, the url page not found .....so the status should be FAILED instead of WARNING ..

Result: 'Item: 11372/LRT-1055 [https://ufal-point-dev.ms.mff.cuni.cz/mj_test/xmlui/admin/item?itemID=2435] has 2 urls to check...

Item: 11372/LRT-1055 has 2 urls to check...

I think both curation codes get it wrong .... The URL exists and active

Result: 'Item: 11372/LRT-1124 [https://ufal-point-dev.ms.mff.cuni.cz/mj_test/xmlui/admin/item?itemID=2504] has 2 urls to check...

Item: 11372/LRT-1124 has 2 urls to check...

I think both curation codes get it wrong .... The URL exists and active

Result: 'Item: 11372/LRT-1149 [https://ufal-point-dev.ms.mff.cuni.cz/mj_test/xmlui/admin/item?itemID=2529] has 2 urls to check...

Item: 11372/LRT-1149 has 2 urls to check...

mjos commented 10 years ago

Quick analysis is as follows:

mjos commented 10 years ago

Fixed in 0ddc50ce690d3d8212d49d8800ba7ca99d7d7c9c.

The last problem was caused by the server de.thefreedictionary.com which blocks clients with User-Agent header containing the word Checker. The word was replaced with Validator and the User-Agent is now "DSpace Link Validator".

Timeouts were increased to 1s for connecting and 3s for reading.

Redirection loops are now checked. The WARNING is now not displayed for redirected URLs - the status of the last redirection is reported instead. But the information about redirection is preserved for non handle links.

loganathanspr commented 10 years ago

Thanks .. It now works for those problematic entries.

loganathanspr commented 10 years ago

I just spotted some more problematic links,

Item: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-4914-D

Item: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-B098-5

Item: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-4904-2

Item: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-4900-A

Item: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0015-8DAF-4

Item: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-9551-4

mjos commented 10 years ago

I'm not able to reproduce the problem on production environment. If I log in and use Edit Item -> Curate -> Fast Check Metadata Links, the given licenses are correctly verified with HTTP status of 200 or 302. Does the problem persist?

loganathanspr commented 10 years ago

It works now. Don't know why it didn't work earlier. Most of the exceptions have been handled, if we see further exceptions in the url checks, this issue can be reopened.