richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
217 stars 30 forks source link

error running `roy harvest -wikidata` #183

Open EG-tech opened 2 years ago

EG-tech commented 2 years ago

I'm trying out the instructions here and am getting the following error/output when trying to run $ roy harvest -wikidata to start off:

2022/04/27 09:23:21 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en' 2022/04/27 09:23:21 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql' 2022/04/27 09:23:21 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/' 2022/04/27 09:24:55 Error trying to retrieve SPARQL with revision history: warning: there were errors retrieving provenance from Wikibase API: wikiprov: unexpected response from server: 429

I'm on Ubuntu 20.04 with the latest siegfried release (1.9.2), is there something obvious I'm doing wrong? (@ross-spencer?)

EG-tech commented 2 years ago

seeing/learning that error code 429 has to do with too many requests hitting the server from the client - is there a way to rate limit the request from roy? or another way to get a Wikidata signature file to start with?

ross-spencer commented 2 years ago

that's exactly it @EG-tech. Tyler had the same a while back. (Via email request so not on Github). We might need to put it into the FAQ.

Some notes on what I wrote to Tyler:

The Wikidata documentation on the query service (WDQS) is here but it's not very clear, i.e. it talks about processing time, not how that translates to some large queries:

https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits

I have known that it is a risk that this might happen, though I can't quantify for how many and when. There are times for example when I have been testing where I have run the query in-upwards of 30 times in a day.

We set a custom header for the request which should be recognized by WDQS and prevent this issue somewhat - it is more friendly to known user-agents for example than unknown ones.

Long-term something approaching rate limiting may work. Right now it's just a single request asking for a lot of data.

In the short-to-medium term, this pull request should mean you can grab an identifier from Richard's itforarchivists server and it will let you get up and running: https://github.com/richardlehane/siegfried/pull/178 (PR just needs review and (and fixes) and merging).

EDIT: NB. For Tyler, he just tried it later in the day or next morning and it worked.

EG-tech commented 2 years ago

thanks @ross-spencer!! that all makes sense, thanks for confirming and I'll play with your suggestion when I get the chance. amazing work!

ross-spencer commented 2 years ago

ah, thanks @EG-tech :slightly_smiling_face:

ross-spencer commented 2 years ago

-update flag now supports Wikidata which should provide a workaround for most facing this issue, there's an underlying reliability issue that might still be solved here as per above.

ross-spencer commented 1 year ago

NB. Just to report, we are still seeing this issue in places. I haven't been able to determine when it is likely a harvest call is likely to be successful other than, it seems to work better in Europe than on the US West-coast.

ross-spencer commented 1 year ago

cc. @thorsted

Someone reached out at the last talk I gave about the Wikidata integration - specifically about long-running queries. I discovered this was because they run a mirror without timeouts for a cost per query. Their service and another example are linked to below:

(I don't think this is the way to go but it's useful to know about)

ross-spencer commented 1 week ago

@anjackson just updated our SPARQL query on the digipres format explorer, the magic is in the FILTER expression, and cuts results from 70,000 to 17,000 (approx.) worth a try to see if it improves performance?

SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
WHERE
{
  # Return records of type File Format or File Format Family (via instance or subclass chain):
  { ?uri wdt:P31/wdt:P279* wd:Q235557 }.

  # Only return records that have at least one useful format identifier
  FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163 [] }.       

  OPTIONAL { ?uri wdt:P2748 ?puid.      }          # PUID is used to map to PRONOM signatures
  OPTIONAL { ?uri wdt:P1195 ?extension. }          # File extension
  OPTIONAL { ?uri wdt:P1163 ?mimetype.  }          # IANA Media Type
  OPTIONAL { ?uri p:P4152 ?object;                 # Format identification pattern statement
    OPTIONAL { ?object pq:P3294 ?encoding.   }     # We don't always have an encoding
    OPTIONAL { ?object ps:P4152 ?sig.        }     # We always have a signature
    OPTIONAL { ?object pq:P2210 ?relativity. }     # Relativity to beginning or end of file
    OPTIONAL { ?object pq:P4153 ?offset.     }     # Offset relative to the relativity
    OPTIONAL { ?object prov:wasDerivedFrom ?provenance;
       OPTIONAL { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY ?uri

@thorsted have you tried the custom sparql technique? https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#using-the-custom-wikibase-functionality-for-wikidata <-- any chance you could try this sparql above to see if it returns more reliably?

(I can create a test binary too)

via https://github.com/digipres/digipres.github.io/pull/48#issuecomment-2222350174

ross-spencer commented 1 week ago

nb. although, this query needs a PUID, or MIMEType, or Extension, and there might be wikidata records without these, so maybe we need to add in sig... e.g. FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163|ps:P4152 [] }.

anjackson commented 1 week ago

Thanks @ross-spencer - FWIW I'm in the midst of writing things up and I'm not at all sure I'm quite there yet. The wdt:P31/wdt:P279* wd:Q235557 seems to be missing out some records (e.g. no *.psd!) , and I'm seeing different variations in different places (wdt:P31*/wdt:P279*, p:P31/ps:P31/wdt:P279*) which I can't say I fully understand at this point.

But the FILTER thing seems to help with the overall size/performance.

ross-spencer commented 1 week ago

@anjackson there was some explanation of these patterns here https://github.com/ffdev-info/wikidp-issues/issues/24#issuecomment-1346662832 via @BertrandCaron that may be helpful?

re: the PSD issue, this is why you included the UNION of file format family? did it work?

anjackson commented 1 week ago

@ross-spencer Yes, adding that UNION brought in PSD, which is declared as a instance of File Format Family but not of File Format (other families appear to be explicitly declared as instances of both). But so did using P31* instead of UNION, as a File Format Family is an instance of a File Format. At the time of writing, UNION matches 69,961 (un FILTERed records) and P31* matches 70,363 so something else is going on too. This is what I'm attempting to write up.

anjackson commented 1 week ago

FWIW, here's what I've written up so far: https://anjackson.net/2024/07/12/finding-formats-in-wikidata/