validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Publish as modular JAR(s) to Maven Central #14

Open anthonyvdotbe opened 4 years ago

anthonyvdotbe commented 4 years ago

Motivation:

I'm willing to help by providing PRs such as:

However, I have a few questions:

@hsivonen @sideshowbarker what are your thoughts on this?

sideshowbarker commented 4 years ago

As far as Maven goes, this project so far isn’t internally relying on Maven for anything. Maven and the Central Repository aren’t the sole way or even the primary way that the HTML parser releases has been distributed. And Maven isn’t the primary build tool used for building releases.

Given that, making any disruptive changes to the sources primarily just to get things to work more easily with Maven seems like a “tail wagging the dog” kind of thing; in particular, the following:

  • adapt the folder layout to Maven conventions
  • split the project into multiple Maven modules as needed

Personally speaking, I don’t use Maven at all. For the HTML checker and its dependencies, the only reason I publish Maven packages is that others have asked for them.

So I’m not super enthusiastic about the prospect of changes being made to the HTML parser sources if those changes would end up meaning that I’d then need to put significant time into making fixes myself to the HTML checker to align it with the HTML parser changes.

However, all that said, if @hsivonen sees benefits in what you’re proposing, then I would not be opposed — and I’d be willing to commit time to reviewing and testing patches.

anthonyvdotbe commented 4 years ago

Thanks for your feedback. Just for the record, my sole goal is to have modular JAR(s) published to Maven Central. I'm open to any suggestions on how to achieve that. What I've described, is just 1 way to do so, but one that I'm willing to help out with. And I could keep the packages unchanged, such that there shouldn't be any code changes at all, except for XOM/jchardet removal.

carlosame commented 4 years ago

@anthonyvdotbe:

adapt the folder layout to Maven conventions

As mentioned by @sideshowbarker, changing the source layout just to adapt to Maven defaults does not look reasonable. My projects do not follow Maven defaults, instead I change the POM to override the defaults.

upgrade the compiler level to 11 as well (the code can stay on the 1.5 syntactic level

I assume that you mean a multi-release jar, but recent JDKs cannot target 1.5 (1.7 is the minimum AFAIR). So the question is whether the users of this project require 1.5 binary compatibility, or which one would be the minimum acceptable. Otherwise the idea of adding a module-info looks good.

@sideshowbarker:

Maven isn’t the primary build tool used for building releases.

No matter which tool is used to build a release, setting an automatic module name would be useful to those of us that have builds for recent JDKs; my pull request does that for Maven.

Otherwise, people using JDKs version 9 or later are going to get similar warnings to the following one that I get when I build my css4j-agent module with Maven:

[WARNING] ******************************************************************************************************************************************
[WARNING] * Required filename-based automodules detected: [htmlparser-1.4.jar]. Please don't publish this project to a public artifact repository! *
[WARNING] ******************************************************************************************************************************************

Having a full JPMS modularization like @anthonyvdotbe proposes would be good, but my pull request is a simpler solution that is available just now.

More information about module names:

Automatic-Module-Name: Calling all Java Library Maintainers

sideshowbarker commented 4 years ago

it's been since 2012 since the last release

Yeah, going for 8 years without a new release is a long time, so I agree it’s time we should strongly consider trying to get a new release out.

Just for the record, my sole goal is to have modular JAR(s) published to Maven Central.

I also agree that given the shift in Java overall to a module system, it makes sense to strongly consider moving to distribution of a modular jar.

And as far as Central, in practice even for non-Maven users, it now seems to be the single primary common place to get jars from — so I also agree it makes sense to provide distributions there.

Having a full JPMS modularization like @anthonyvdotbe proposes would be good, but my pull request is a simpler solution that is available just now.

I personally have no objection to first just merging that small change from #13, if @hsivonen is supportive. But even if we were to only do that, we’d still need to do a new release, right?

If we’re gonna release a new jar at all, I guess it makes sense to strongly consider doing the work to ensure it’s got the whole set of current-best-practice characteristics for jar releases that are expected by users nowadays.

But I can’t commit to support for moving forward with anything until @hsivonen weighs in.

carlosame commented 4 years ago

If we’re gonna release a new jar at all, I guess it makes sense to strongly consider doing the work to ensure it’s got the whole set of current-best-practice characteristics for jar releases

Other projects are releasing a jar with an automatic module first, and then another with full modularization. But any issues about split packages etc. should be resolved before that. Also, my PR assumes that there will be a single module.

I am unaware of any split-package issue inside this project (yes the dependencies do have them), and the single-module approach seems correct unless one wants to start making deep changes to the project just now.

But of course if one and only one release is going to be produced, by all means let it have full JPMS.

carlosame commented 4 years ago

@anthonyvdotbe I forgot to mention something about this:

remove jchardet and XOM support

There are tricks to use dependencies with split packages (I have a similar problem in my css4j-dom4j module), so XOM could possibly be kept. Yes, I know that's less than perfect.

And jchardet could be kept as a filename-generated module name (much like this project's htmlparser.jar is currently). Again, yes I know...

hsivonen commented 4 years ago

I'm completely OK with publishing a new release. I should have done it many times over the past years. It's been a "next week" item for way too long. 8 years indeed is a long time. 😬

I also want to get rid of jchardet and the ICU detectors. Neither represents present-day browser-compatible behavior. Ideally, these would be replaced with chardetng, but adding JNI dependencies isn't really in the Java culture, and I haven't seriously investigating compiling it into Java byte code. (asmble might be an avenue.)

I'd like to better understand the motivation for removing XOM support. I gather that Maven ships both source packages and binary packages. The parser is designed such that it requires a XOM jar to be present at build time, but once you have the binary jar, if you don't want to use XOM, you don't need to have XOM present. How big a problem is this with Maven?

It also seems bad to make breaking changes (shuffle around the Java package naming structure) for the sake of Maven. saxtree is useful on its own and it's possible to run htmlparser in a mode that doesn't use it. However, the htmlparser API is such that not using saxtree is a run-time decision, so saxtree needs to be a hard dependency for htmlparser in order not to break API compatibility.

So I'd be OK with saxtree being spun off into a distinct Maven artifact, but I'm not happy about the idea of moving it in the package naming structure just to please Maven.

encodings is a project that has started but isn't in use and is unlikely to be finished anytime soon. It probably makes sense to throw it away and use Peter Occil's implementation.

[WARNING] * Required filename-based automodules detected: [htmlparser-1.4.jar]. Please don't publish this project to a public artifact repository! *

Do I understand correctly that some policy change prevents next step being just bumping the version number is pom.xml and publishing the same way as 8 years ago?

carlosame commented 4 years ago

I'd like to better understand the motivation for removing XOM support.

I believe that he wants to remove XOM because it is not modularized.

shuffle around the Java package naming structure) for the sake of Maven

Neither Maven nor JPMS require that (in the specific case of this project). The motivation to move saxtree is unclear to me.

encodings is a project that has started but isn't in use and is unlikely to be finished anytime soon.

Looks like a good candidate for a package that is not to be JPMS-exported (if not removed finally).

Do I understand correctly that some policy change prevents next step being just bumping the version number is pom.xml and publishing the same way as 8 years ago?

They were considering to automatically enforce the rejection, but do not think that they finally did it. In any case, this is only going to affect the Maven deployment if your build is a JPMS-modularized one.

The warning is mostly due to concerns about a possible 'DLL hell' if explicit module names and filename-based names are mixed together. That's why people are encouraged to declare a name ASAP.

anthonyvdotbe commented 4 years ago

I also want to get rid of jchardet and the ICU detectors. Neither represents present-day browser-compatible behavior. Ideally, these would be replaced with chardetng, but adding JNI dependencies isn't really in the Java culture, and I haven't seriously investigating compiling it into Java byte code. (asmble might be an avenue.)

asmble looks interesting indeed. Another alternative might be to run it on GraalVM. There's also project Panama, but that's WIP and focused on C for now. Either way, just to set expectations, supporting chardetng is out of my league :)

I'd like to better understand the motivation for removing XOM support. I gather that Maven ships both source packages and binary packages. The parser is designed such that it requires a XOM jar to be present at build time, but once you have the binary jar, if you don't want to use XOM, you don't need to have XOM present. How big a problem is this with Maven?

tl;dr I'll leave XOM support untouched, since it doesn't pose a problem to neither Maven, nor Java modules.

(The motivation was: we should start with a clean slate and drop support for anything that isn't itself modularized (or is likely to be anytime soon). And modularizing XOM will be hard, because it depends, both directly & indirectly, on xml-apis:xml-apis (the version that is currently depended on by htmlparser, 1.1, uses its predecessor, xerces:xmlParserAPIs). The problem with these artifacts, is that they package APIs which are part of Java SE's java.xml module. And it's not possible to use 2 modules that define the same package (at least not in the same ModuleLayer).)

It also seems bad to make breaking changes (shuffle around the Java package naming structure) for the sake of Maven. saxtree is useful on its own and it's possible to run htmlparser in a mode that doesn't use it. However, the htmlparser API is such that not using saxtree is a run-time decision, so saxtree needs to be a hard dependency for htmlparser in order not to break API compatibility.

No problem, the package structure will be retained.

(The motivation for asking about moving it into the htmlparser package was: "I don't readily see how this can be useful on its own, and moving it inside htmlparser would allow to limit ourselves to a single module")

So I'd be OK with saxtree being spun off into a distinct Maven artifact, but I'm not happy about the idea of moving it in the package naming structure just to please Maven.

encodings is a project that has started but isn't in use and is unlikely to be finished anytime soon. It probably makes sense to throw it away and use Peter Occil's implementation.

Given the above, I'm proposing to remove the encodings package, and split off saxtree in its own Maven module. So we would have 2 Maven modules, each mapping to their own Java module, namely nu.validator.htmlparser, and nu.validator.saxtree. Package names will remain unchanged. @hsivonen Are you OK with this?

[WARNING] * Required filename-based automodules detected: [htmlparser-1.4.jar]. Please don't publish this project to a public artifact repository! *

Do I understand correctly that some policy change prevents next step being just bumping the version number is pom.xml and publishing the same way as 8 years ago?

I'm not sure I understand your question, but I can explain what the warning means: JARs which aren't modularized (i.e. don't contain a module-info.class file), can be used as modules nonetheless (in order to ease migration, such JARs are called automatic modules). As an intermediary step, library maintainers can add an entry in the manifest (as in #13), which asserts "once this library is modularized, its module name will be as given". Now, when an umodularized JAR is put on the module path, Java checks that entry to determine its module name. If it doesn't have it, Java determines the module name from the JARs filename. The problem is: if I want to modularize my library, foo, which depends on library bar, where bar isn't modularized yet, I can write:

module foo {
    requires bar;
}

but as long as bar doesn't at least have the manifest entry to assert its future module name, this is risky and I shouldn't publish my library (since the module name is derived from a JARs filename and may still change in the future either way). And that's what the warning says: you shouldn't publish a library whose module descriptor requires an artifact whose module name is filename-based.

carlosame commented 4 years ago

Something that has not been mentioned: you are already declaring a module name, just OSGi instead of JPMS:

Bundle-SymbolicName: nu.validator.htmlparser

So you are already shipping those packages under that name (not sure if you have real OSGi users though).

carlosame commented 4 years ago

moving it inside htmlparser would allow to limit ourselves to a single module [...] split off saxtree in its own Maven module. So we would have 2 Maven modules, each mapping to their own Java module

There is no need to have two modules, you can have the two packages under the same module.

anthonyvdotbe commented 4 years ago

moving it inside htmlparser would allow to limit ourselves to a single module [...] split off saxtree in its own Maven module. So we would have 2 Maven modules, each mapping to their own Java module

There is no need to have two modules, you can have the two packages under the same module.

Yes, but from the following:

it follows that saxtree should be split off into its own Maven module

carlosame commented 4 years ago

Modules are intended to group related packages, and saxtree is closely related to the other package(s). If your modification proceeds, users would have to change their setup for the additional artifact, as opposed to just bumping the version number in the Maven POM or whatever they were using.

Not to talk about potential additional work for those in the validator-nu branch, as they would need to adapt to the changes in master if they want to stay somehow in sync.

Let's see what @hsivonen thinks though.

sideshowbarker commented 4 years ago

Not to talk about potential additional work for those in the validator-nu branch, as they would need to adapt to the changes in master if they want to stay somehow in sync.

I’ve come around to realizing that the best way to avoid any problems with keeping the validator-nu branch in sync is to just end-of-life that branch, as proposed in #17.

I also want to retract something related that I wrote here earlier:

So I’m not super enthusiastic about the prospect of changes being made to the HTML parser sources if those changes would end up meaning that I’d then need to put significant time into making fixes myself to the HTML checker to align it with the HTML parser changes.

I realize now it wasn’t really relevant to mention that here — because for one thing, the mechanism the HTML checker uses to build the integrated parser is that it directly walks the src tree here to collect the pathnames for the source files, and then essentially just calls javac on those. So in hindsight I can’t really imagine what kind of changes to the sources would require me to make any significant changee to the HTML checker’s build of the integrated parser.

Also, any choice of build tools we end up deciding on within the parser repo itself isn’t gonna affect the checker build of the parser. I would just be keeping the checker build of the parser the same anyway, regardless.

So please ignore anything I wrote here previously about needing to consider the HTML checker usage of the parser when making decisions about how we want to do things here going forward.

Whatever we decide, it should just be based on trying to determine what’s best for the wider community of users for the parser, and current best practices.

anthonyvdotbe commented 4 years ago

To summarize, here's the open questions I'd like confirmation on. Is it ok to...:

I also filed https://github.com/elharo/xom/issues/142 w.r.t. XOM support: to be able to publish modular JARs, we'd need a XOM release which has at least Automatic-Module-Name set, but such release is currently not available yet (technically this isn't true: we could manually change the name of the XOM jar to match the intended module name when publishing, but that's just messy, imho).

carlosame commented 4 years ago

Java SE 8 is pretty much EOL by now

There are about 60% of developers still on Java 8, according to latest surveys, and OpenJDK 8 is supported until May 2026. Figures are between 3 and 7% for Java 7 usage.

My project produces multi-release JARs valid for Java 8 and 11 (7 and 11 for 1.0), and a dependency that only runs on 11 would be useless.

anthonyvdotbe commented 4 years ago

My project produces multi-release JARs valid for Java 8 and 11 (7 and 11 for 1.0), and a dependency that only runs on 11 would be useless.

Good point, I scrapped the question: the sources themselves will still be compiled for Java SE 8.

carlosame commented 4 years ago

the sources themselves will still be compiled for Java SE 8.

Not sure that I follow you here: do you mean that the source level will be 8, or that you compile targeting Java 8?

The former would be irrelevant to the JAR, and the latter rules out module-info and, thus, modularization.

anthonyvdotbe commented 4 years ago

With "the sources themselves", I meant: "everything but module-info.java". So: module-info will be compiled with --release 11, and everything else will be compiled with --release 8

carlosame commented 4 years ago

With "the sources themselves", I meant: "everything but module-info.java". So: module-info will be compiled with --release 11, and everything else will be compiled with --release 8

module-info.java is a source as well, so we are talking about a multi-release jar now.

anthonyvdotbe commented 4 years ago

Strictly speaking it won't be a multi-release JAR, but even it it were: I don't see what point you're trying to make? The JAR will work as a modular JAR on Java SE 11+, and as a plain old JAR on Java SE 8.

carlosame commented 4 years ago

I don't see what point you're trying to make?

My point is to try to clarify what you are proposing, because I suggested to target different releases in my first response to your post (where you talked about keeping source level 1.5), and your last summary still involved a Java 11-only thing.

Saying "the sources themselves will still be compiled for Java SE 8" in the context of a modular project is something that needs clarification.

An then, I wonder why Java 8 and not 7.

anthonyvdotbe commented 4 years ago

where you talked about keeping source level 1.5

I never talked about keeping source level 1.5. I said:

[...] upgrade the compiler level to 11 as well (the code can stay on the 1.5 syntactic level. I'm not planning to make any code changes, apart from package declarations and import statements)

which I thought was sufficiently clear to mean:

I'm planning to use --release 11 to compile, but I won't make code changes that require to compile with a --source value that is higher than what's currently used by the project, which, looking at the POM, is 1.5 (apart from introducing module-info.java files, obviously)