qudt / qudt-public-repo

QUDT -Quantities, Units, Dimensions and dataTypes - public repository
Other
110 stars 71 forks source link

Use a build tool #959

Open fkleedorfer opened 3 weeks ago

fkleedorfer commented 3 weeks ago

Use a build tool?

Problem: All Issues brought up so far require or aim at some kind of build automation. There currently is none.

Why is that a problem: Anything that needs to be done manually will cause errors, bottlenecks and dependency on individuals

Cause: Most programming languages/frameworks come with a variety of build tools, and most projects use one. However, this is an ontology project, inherently independent from programming languages, and therefore, it is not obvious what should be used. That is probably the reason why none is in use.

Fix: Choose one build tool that the community can live with and refactor the project so it uses that tool. Bonus: github actions become easier to make and maintain because they might only need to run some build targets

So, question: What would be your criteria for choosing a build tool, and which one, if any, should it be?

Originally posted by @fkleedorfer in https://github.com/qudt/qudt-public-repo/discussions/942#discussioncomment-9902240

Edit: collecting requirements/ideas/aspects from the comments here (and my own)

This issue is not about adding new functionality, just about automating what is currently done manually or semi automatically

Incomplete list of future functionality to be implemented in the build

fkleedorfer commented 3 weeks ago

I think this is the first thing we need if we are to get some automation going. I'll make a draft PR soonish.

VladimirAlexiev commented 2 weeks ago

hi @fkleedorfer ! Good idea, but could you elaborate a bit on what do you want to automate? Let's gather a list of requirements here (cc @steveraysteveray @ralphtq). Florian, can you undertake to collect requirements and put them in the issue description, or if you prefer in a separate file (guess that's what the PR you mentioned will be about?)

fkleedorfer commented 2 weeks ago

Would like the work to be done in reasonable small chunks (because I dont have enormous amounts of time for it), so I'd like to first not add new functionality, just automate existing.

We are looking at a lot of things that can be added once the build automation is in place.

The first problem is choosing the build system itself. I did not get a lot of input on the question in the discussion, however, the current favorite is maven. That's what my PR will be about. At the moment I am looking at how to do TTL formatting in that setting. (Probably jena prettyprint but we'll see, there is also https://github.com/atextor/turtle-formatter ). Weirdly, no maven integration for either. (Sideglance spotless)

VladimirAlexiev commented 2 weeks ago

@fkleedorfer But is there a problem with the turtle formatting of QUDT? I think it comes from TQ, and I think it's just fine?

fkleedorfer commented 2 weeks ago

@fkleedorfer But is there a problem with the turtle formatting of QUDT?

(Accidentally deleted my post so I rewrite it here) Yes: contributors cannot reproduce it. When you contribute triples, you'll add them wherever, and at some point steve pulls the code, reformats it and pushes it. Thats not a great workflow.

If formatting was part of the build, our life would be easier.

That is not to say that TQ formatting is bad. If we can use it in a build then mayb we should.

steveraysteveray commented 2 weeks ago

I think the serialization we use in TopBraid is fairly common - alphabetical by grouped subject - isn't it? I assume that same serialization is available via the TQ API if we use that for inferencing and validation in the build, although I haven't checked. I'm not sure what the PySHACL library does, but my understanding is that it is slower and not complete.

dr-shorthair commented 2 weeks ago

OWL-API is also common.

@ashleysommer @nicholascar can you comment on completeness of pySHACL?

dr-shorthair commented 2 weeks ago

Else go for RDF Canonicalization https://www.w3.org/TR/rdf-canon/ JS Implementation here: https://github.com/digitalbazaar/rdf-canonize RDFlib here?: https://github.com/eyusupov/rdflib-canon

(is this in the TQ Suite?)

fkleedorfer commented 2 weeks ago

Canonicalization is relevant for consistent ordering of blank nodes across multiple serializations. That's the one thing most formatters will fail to do.

VladimirAlexiev commented 1 week ago

Don't most contributors submit relatively small PRs, typically new units, where they can follow the existing formatting even by hand?

In addition to the question of formatting, let's collect other needs for a build workflow. Like checking data consistency using SPARQL. see my two bullets above.

fkleedorfer commented 1 week ago

Like checking data consistency using SPARQL

Would you be ok wrapping the SPARQL queries in a SHACL shape or would you prefer another way, such as a folder with files containing sparql queries, and some convention for how their results should be interpreted?

steveraysteveray commented 6 days ago

I vote for a SHACL shape, since we already do other validations that way (not yet part of the build).

VladimirAlexiev commented 3 days ago

@steveraysteveray and @fkleedorfer

SHACL vs SPARQL:

the serialization we use in TopBraid is fairly common - alphabetical by grouped subject

I like it. If classes and props follow naming conventions, then that sorts them in the proper order. I'd just move individuals last: but most ontologies have terms or individuals, not both, so that's ok.

But I see Florian contributing to https://github.com/atextor/turtle-formatter: Can you share impressions and should we use it instead of TQ TB?

fkleedorfer commented 3 days ago

But I see Florian contributing to https://github.com/atextor/turtle-formatter: Can you share impressions and should we use it instead of TQ TB?

My point would be that formatting should be accessible to any developer who wants to contribute. I don't think that will be the case with TopBraid. I was hoping to be able to do it with jena, but it's not so simple. turtle-formatter is a decent solution for us (if it works, which is what I'm working on).

As there is more to formatting your codebase than just formatting one file, I've prepared a contribution to spotless - a spotless RDF plugin, if you like, that will use whatever we manage on the file-formatting side (turtle-formatter for TTL, jena for everything else, or just not support anything else), to format the whole codebase. The spotless RDF plugin is more or less done, except for tests, and we'll need a published turtle-formatter jar with our changes.

EDIT: My impression of turtle-formatter is that its default output is ok, it is highly configurable, and the codebase is small and I'm confident we can contribute any formatting options that we need, for example, individuals last.

VladimirAlexiev commented 3 days ago
dr-shorthair commented 2 days ago

@nicholascar is this the formatter you use? (I think you'll had a standard turtle formatter to help with diffs)