Open EG-tech opened 2 years ago
seeing/learning that error code 429 has to do with too many requests hitting the server from the client - is there a way to rate limit the request from roy? or another way to get a Wikidata signature file to start with?
that's exactly it @EG-tech. Tyler had the same a while back. (Via email request so not on Github). We might need to put it into the FAQ.
Some notes on what I wrote to Tyler:
The Wikidata documentation on the query service (WDQS) is here but it's not very clear, i.e. it talks about processing time, not how that translates to some large queries:
https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits
I have known that it is a risk that this might happen, though I can't quantify for how many and when. There are times for example when I have been testing where I have run the query in-upwards of 30 times in a day.
We set a custom header for the request which should be recognized by WDQS and prevent this issue somewhat - it is more friendly to known user-agents for example than unknown ones.
Long-term something approaching rate limiting may work. Right now it's just a single request asking for a lot of data.
In the short-to-medium term, this pull request should mean you can grab an identifier from Richard's itforarchivists server and it will let you get up and running: https://github.com/richardlehane/siegfried/pull/178 (PR just needs review and (and fixes) and merging).
EDIT: NB. For Tyler, he just tried it later in the day or next morning and it worked.
thanks @ross-spencer!! that all makes sense, thanks for confirming and I'll play with your suggestion when I get the chance. amazing work!
ah, thanks @EG-tech :slightly_smiling_face:
-update
flag now supports Wikidata which should provide a workaround for most facing this issue, there's an underlying reliability issue that might still be solved here as per above.
NB. Just to report, we are still seeing this issue in places. I haven't been able to determine when it is likely a harvest call is likely to be successful other than, it seems to work better in Europe than on the US West-coast.
cc. @thorsted
Someone reached out at the last talk I gave about the Wikidata integration - specifically about long-running queries. I discovered this was because they run a mirror without timeouts for a cost per query. Their service and another example are linked to below:
(I don't think this is the way to go but it's useful to know about)
@anjackson just updated our SPARQL query on the digipres format explorer, the magic is in the FILTER expression, and cuts results from 70,000 to 17,000 (approx.) worth a try to see if it improves performance?
SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
WHERE
{
# Return records of type File Format or File Format Family (via instance or subclass chain):
{ ?uri wdt:P31/wdt:P279* wd:Q235557 }.
# Only return records that have at least one useful format identifier
FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163 [] }.
OPTIONAL { ?uri wdt:P2748 ?puid. } # PUID is used to map to PRONOM signatures
OPTIONAL { ?uri wdt:P1195 ?extension. } # File extension
OPTIONAL { ?uri wdt:P1163 ?mimetype. } # IANA Media Type
OPTIONAL { ?uri p:P4152 ?object; # Format identification pattern statement
OPTIONAL { ?object pq:P3294 ?encoding. } # We don't always have an encoding
OPTIONAL { ?object ps:P4152 ?sig. } # We always have a signature
OPTIONAL { ?object pq:P2210 ?relativity. } # Relativity to beginning or end of file
OPTIONAL { ?object pq:P4153 ?offset. } # Offset relative to the relativity
OPTIONAL { ?object prov:wasDerivedFrom ?provenance;
OPTIONAL { ?provenance pr:P248 ?reference;
pr:P813 ?date.
}
}
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY ?uri
@thorsted have you tried the custom sparql technique? https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#using-the-custom-wikibase-functionality-for-wikidata <-- any chance you could try this sparql above to see if it returns more reliably?
(I can create a test binary too)
via https://github.com/digipres/digipres.github.io/pull/48#issuecomment-2222350174
nb. although, this query needs a PUID, or MIMEType, or Extension, and there might be wikidata records without these, so maybe we need to add in sig... e.g. FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163|ps:P4152 [] }.
Thanks @ross-spencer - FWIW I'm in the midst of writing things up and I'm not at all sure I'm quite there yet. The wdt:P31/wdt:P279* wd:Q235557
seems to be missing out some records (e.g. no *.psd
!) , and I'm seeing different variations in different places (wdt:P31*/wdt:P279*
, p:P31/ps:P31/wdt:P279*
) which I can't say I fully understand at this point.
But the FILTER
thing seems to help with the overall size/performance.
@anjackson there was some explanation of these patterns here https://github.com/ffdev-info/wikidp-issues/issues/24#issuecomment-1346662832 via @BertrandCaron that may be helpful?
re: the PSD issue, this is why you included the UNION of file format family? did it work?
@ross-spencer Yes, adding that UNION brought in PSD, which is declared as a instance of File Format Family
but not of File Format
(other families appear to be explicitly declared as instances of both). But so did using P31*
instead of UNION
, as a File Format Family
is an instance of a File Format
. At the time of writing, UNION
matches 69,961 (un FILTER
ed records) and P31*
matches 70,363 so something else is going on too. This is what I'm attempting to write up.
FWIW, here's what I've written up so far: https://anjackson.net/2024/07/12/finding-formats-in-wikidata/
I'm trying out the instructions here and am getting the following error/output when trying to run
$ roy harvest -wikidata
to start off:I'm on Ubuntu 20.04 with the latest siegfried release (
1.9.2
), is there something obvious I'm doing wrong? (@ross-spencer?)