richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Implement digippres.org changes and min sig length #253

Closed ross-spencer closed 4 months ago

ross-spencer commented 4 months ago

A filter has been added to the default query to reduce the number of results relevant to us. This reduces the Wikidata query service query time as well as the amount of time required to generate provenance.

Additionally, because of TrID issues, we implement a minimum signature length that reduces time even further.

Connected to: https://github.com/ffdev-info/wikidp-issues/issues/32 Connected to: https://github.com/richardlehane/siegfried/issues/183 Connected to: https://github.com/ffdev-info/wikidp-issues/issues/38 Co-authored-by: @anjackson

nb. default roy is not able to harvest from Wikidata on my computer anymore, at least not within 10 mins. Maybe if left longer.

With filter

time ./roy harvest --wikidata
2024/07/13 14:10:48 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2024/07/13 14:10:48 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2024/07/13 14:10:48 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2024/07/13 14:15:37 Roy (Wikidata): Harvesting Wikidata definitions '/home/r0ss/siegfried/wikidata/wikidata-definitions-3.0.0' complete

real    4m48.907s
user    0m10.730s
sys 0m2.906s
./roy build -wikidata
2024/07/13 15:21:55 Roy (Wikidata): Congratulations: doing something with the Wikidata identifier package!
2024/07/13 15:21:55 Roy (Wikidata): Opening Wikidata definitions: /home/r0ss/siegfried/wikidata/wikidata-definitions-3.0.0
2024/07/13 15:22:40 {
  "AllSparqlResults": 16749,
  "CondensedSparqlResults": 13720,
  "SparqlRowsWithSigs": 10609,
  "RecordsWithPotentialSignatures": 9133,
  "FormatsWithBadHeuristics": 60,
  "RecordsWithSignatures": 9073,
  "MultipleSequences": 12,
  "AllLintingMessages": [
    "Use the `-wikidataDebug` flag to build the identifier to see linting messages"
  ],
  "AllLintingMessageCount": 253,
  "RecordCountWithLintingMessages": 199
}
2024/07/13 15:22:40 Roy (Wikidata): Building identifiers set from PRONOM
2024/07/13 15:22:50 Roy (Wikidata): In Infos()... length formats: '13720' no-pronom: 'false'
2024/07/13 15:22:50 Roy (Wikidata): Adding Glob signatures to identifier...
2024/07/13 15:22:50 Roy (Wikidata): Adding container signatures to identifier...
2024/07/13 15:22:51 Roy (Wikidata): Adding container signatures to identifier...
2024/07/13 15:22:52 Roy (Wikidata): Adding Wikidata Byte signatures to identifier...

With siglen min 6 (3 bytes) signature length:

time ./roy harvest -wikidata
2024/07/13 15:54:37 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2024/07/13 15:54:37 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2024/07/13 15:54:37 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2024/07/13 15:57:21 Roy (Wikidata): Harvesting Wikidata definitions '/home/r0ss/siegfried/wikidata/wikidata-definitions-3.0.0' complete

real    2m43.534s
user    0m6.724s
sys 0m1.961s
./roy build -wikidata
2024/07/13 15:57:54 Roy (Wikidata): Congratulations: doing something with the Wikidata identifier package!
2024/07/13 15:57:54 Roy (Wikidata): Opening Wikidata definitions: /home/r0ss/siegfried/wikidata/wikidata-definitions-3.0.0
2024/07/13 15:58:09 {
  "AllSparqlResults": 9284,
  "CondensedSparqlResults": 8118,
  "SparqlRowsWithSigs": 9284,
  "RecordsWithPotentialSignatures": 8118,
  "FormatsWithBadHeuristics": 46,
  "RecordsWithSignatures": 8072,
  "MultipleSequences": 12,
  "AllLintingMessages": [
    "Use the `-wikidataDebug` flag to build the identifier to see linting messages"
  ],
  "AllLintingMessageCount": 195,
  "RecordCountWithLintingMessages": 156
}
2024/07/13 15:58:09 Roy (Wikidata): Building identifiers set from PRONOM
2024/07/13 15:58:19 Roy (Wikidata): In Infos()... length formats: '8118' no-pronom: 'false'
2024/07/13 15:58:19 Roy (Wikidata): Adding Glob signatures to identifier...
2024/07/13 15:58:19 Roy (Wikidata): Adding container signatures to identifier...
2024/07/13 15:58:19 Roy (Wikidata): Adding container signatures to identifier...
2024/07/13 15:58:20 Roy (Wikidata): Adding Wikidata Byte signatures to identifier...
ross-spencer commented 4 months ago

@richardlehane cc. @thorsted -- is there any chance we can build a release candidate with these changes to trial them?

richardlehane commented 4 months ago

@ross-spencer @thorsted these changes now on the develop branch and built as a release candidate (Version 1.11.2-rc0)