Open fkleedorfer opened 3 weeks ago
I think this is the first thing we need if we are to get some automation going. I'll make a draft PR soonish.
hi @fkleedorfer ! Good idea, but could you elaborate a bit on what do you want to automate? Let's gather a list of requirements here (cc @steveraysteveray @ralphtq). Florian, can you undertake to collect requirements and put them in the issue description, or if you prefer in a separate file (guess that's what the PR you mentioned will be about?)
Would like the work to be done in reasonable small chunks (because I dont have enormous amounts of time for it), so I'd like to first not add new functionality, just automate existing.
We are looking at a lot of things that can be added once the build automation is in place.
The first problem is choosing the build system itself. I did not get a lot of input on the question in the discussion, however, the current favorite is maven. That's what my PR will be about. At the moment I am looking at how to do TTL formatting in that setting. (Probably jena prettyprint but we'll see, there is also https://github.com/atextor/turtle-formatter ). Weirdly, no maven integration for either. (Sideglance spotless)
@fkleedorfer But is there a problem with the turtle formatting of QUDT? I think it comes from TQ, and I think it's just fine?
@fkleedorfer But is there a problem with the turtle formatting of QUDT?
(Accidentally deleted my post so I rewrite it here) Yes: contributors cannot reproduce it. When you contribute triples, you'll add them wherever, and at some point steve pulls the code, reformats it and pushes it. Thats not a great workflow.
If formatting was part of the build, our life would be easier.
That is not to say that TQ formatting is bad. If we can use it in a build then mayb we should.
I think the serialization we use in TopBraid is fairly common - alphabetical by grouped subject - isn't it? I assume that same serialization is available via the TQ API if we use that for inferencing and validation in the build, although I haven't checked. I'm not sure what the PySHACL library does, but my understanding is that it is slower and not complete.
OWL-API is also common.
@ashleysommer @nicholascar can you comment on completeness of pySHACL?
Else go for RDF Canonicalization https://www.w3.org/TR/rdf-canon/ JS Implementation here: https://github.com/digitalbazaar/rdf-canonize RDFlib here?: https://github.com/eyusupov/rdflib-canon
(is this in the TQ Suite?)
Canonicalization is relevant for consistent ordering of blank nodes across multiple serializations. That's the one thing most formatters will fail to do.
Don't most contributors submit relatively small PRs, typically new units, where they can follow the existing formatting even by hand?
In addition to the question of formatting, let's collect other needs for a build workflow. Like checking data consistency using SPARQL. see my two bullets above.
Like checking data consistency using SPARQL
Would you be ok wrapping the SPARQL queries in a SHACL shape or would you prefer another way, such as a folder with files containing sparql queries, and some convention for how their results should be interpreted?
I vote for a SHACL shape, since we already do other validations that way (not yet part of the build).
@steveraysteveray and @fkleedorfer
SHACL vs SPARQL:
message, value
etcthe serialization we use in TopBraid is fairly common - alphabetical by grouped subject
I like it. If classes and props follow naming conventions, then that sorts them in the proper order. I'd just move individuals last: but most ontologies have terms or individuals, not both, so that's ok.
But I see Florian contributing to https://github.com/atextor/turtle-formatter: Can you share impressions and should we use it instead of TQ TB?
But I see Florian contributing to https://github.com/atextor/turtle-formatter: Can you share impressions and should we use it instead of TQ TB?
My point would be that formatting should be accessible to any developer who wants to contribute. I don't think that will be the case with TopBraid. I was hoping to be able to do it with jena, but it's not so simple. turtle-formatter is a decent solution for us (if it works, which is what I'm working on).
As there is more to formatting your codebase than just formatting one file, I've prepared a contribution to spotless - a spotless RDF plugin, if you like, that will use whatever we manage on the file-formatting side (turtle-formatter for TTL, jena for everything else, or just not support anything else), to format the whole codebase. The spotless RDF plugin is more or less done, except for tests, and we'll need a published turtle-formatter jar with our changes.
EDIT: My impression of turtle-formatter is that its default output is ok, it is highly configurable, and the codebase is small and I'm confident we can contribute any formatting options that we need, for example, individuals last.
turtle-formatter
@atextor is actively engaged and responsive: a big plusturtle-formatter
for some large-scale electrical ontologies (CIM/CGMES)@nicholascar is this the formatter you use? (I think you'll had a standard turtle formatter to help with diffs)
Use a build tool?
Problem: All Issues brought up so far require or aim at some kind of build automation. There currently is none.
Why is that a problem: Anything that needs to be done manually will cause errors, bottlenecks and dependency on individuals
Cause: Most programming languages/frameworks come with a variety of build tools, and most projects use one. However, this is an ontology project, inherently independent from programming languages, and therefore, it is not obvious what should be used. That is probably the reason why none is in use.
Fix: Choose one build tool that the community can live with and refactor the project so it uses that tool. Bonus: github actions become easier to make and maintain because they might only need to run some build targets
So, question: What would be your criteria for choosing a build tool, and which one, if any, should it be?
Originally posted by @fkleedorfer in https://github.com/qudt/qudt-public-repo/discussions/942#discussioncomment-9902240
Edit: collecting requirements/ideas/aspects from the comments here (and my own)
This issue is not about adding new functionality, just about automating what is currently done manually or semi automatically
Incomplete list of future functionality to be implemented in the build