pyvandenbussche / lov

Linked Open Vocabularies (LOV) - FrontEnd
http://lov.okfn.org/dataset/lov/
62 stars 12 forks source link

LOV data dump hosted as a git repo #80

Open hoijui opened 5 months ago

hoijui commented 5 months ago

Available here: https://codeberg.org/elevont/lov-dump

I did this, because for software relying on this, it comes in handy to be able to include it as a git sub-module, instead of having to download it before or during the build process. it prevents needless re-downloads, security policy bells ringing, and many other, similar issues.

I did it on codeberg.org, because both GitHub and Gitlab.com have 100MB blob size limits, and this is 208MB

VladimirAlexiev commented 2 months ago

@hoijui thanks! I listed your dump on Wikidata: https://www.wikidata.org/wiki/Q39392701#P4945 .

Please state your update policy. Your dump is 3m old, but this query at https://lov.linkeddata.es/dataset/lov/sparql

prefix dct:  <http://purl.org/dc/terms/>
select * { # (max(?upd) as ?updated) {
  ?x dct:modified ?upd
} order by desc(?upd) limit 20

shows newer stuff:

Is it because the LOV dump is 3m old, or you don't track it regularly?

hoijui commented 2 months ago

Thank you for that.. Indeed, I completely neglected that!| I think that happened so, because initially I planned to do this with GitHub Actions, but then moving to codeberg made this more cumbersome, and it got lost. Of course, it is of little use without this, so.. thank you! How would you do it? As codeberg has limited resources, I think it would be good to use a scheduled (e.g. once a day) GitHub action, and push eventual changes over to codeberg. It should be relatively straight-forward, as long as I don;t run into any size or access limitations...

hoijui commented 2 months ago

It should now be updated daily (if there are changes) from this repos CI: https://github.com/elevont/lov-dump-updater

... but ... it looks like there is an issue with the blank-nodes. :/ on each data dump, they get assigned different (random) IDs, and this shows up in the diff, of course. So about 1/3 of all lines show up as changed. That is of course not meaningful, nor maintainable over time. Any idea for how to solve this? The best way would be to have fixed Ids for blank-nodes (as in, they don't change between data dumps. Are you from the LOV team, by any chance?

VladimirAlexiev commented 2 months ago

https://github.com/atextor/turtle-formatter/issues/8 : there is active development on this tool, and stability of blank nodes is one of the issues being addressed.

You use it as described at https://atextor.de/owl-cli/main/snapshot/usage.html#write-command

I'm not from the LOV team, if indeed there is such.