Decide on the way forward

validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/

Other

56 stars 26 forks source link

Decide on the way forward #26

Open anthonyvdotbe opened 4 years ago

anthonyvdotbe commented 4 years ago

@hsivonen: as the de facto owner of this repository, I'd like to urge you to decide on the way forward, because I feel things are getting out of hand: @carlosame and I have different views on how to proceed, and we're both investing quite a bit of our spare time in implementing/defending our point of view. I believe that it's in everyone's interest for a final decision to be taken sooner rather than later. (I understand if modularization is not a priority for neither you nor your employer, but I believe it's not unreasonable to ask some of your time to settle this debate.)

1. Java modularization

There are plenty of options, starting from the most fine-grained modularization:

one module per XML API, with 2 base modules
- htmlparser.dom -> htmlparser.common
- htmlparser.sax -> htmlparser.common, saxtree
- htmlparser.xom -> htmlparser.common
merge all htmlparser modules: htmlparser + saxtree
merge all modules into 1: htmlparser
merge all modules into 1, except the XOM module: htmlparser + htmlparser.xom
some other combination:
- htmlparser + htmlparser.xom + saxtree
- htmlparser.jaxp + htmlparser.xom + htmlparser.common + saxtree
- ...

Q1: which combination makes most sense to you? (edit: please note that this question is about modules as a generic software architectural concept, not about Java modules specifically. So any Java technicalities can be disregarded when answering this question. I've merely named it "Java modularization" in the title to clarify that it's about the conceptual modularization of the code, not about how the code is organized in Maven modules or Git submodules or anything like that)

2. Project organization

Assuming there's at least 2 modules in the chose combination, there's multiple options on how to organize them:

a single multi-module Maven project in a single git repo
multiple single-module Maven projects in multiple git repos
some other organization

(since @carlosame claimed that the first option would "complicate the workflow [...] and expose the project to a new category of Maven bugs": I claim the exact opposite, so please disregard any Maven-related concerns)

Q2: which option works best for you & @sideshowbarker?

3. upcoming releases

Edit: given that we agree that backward compatibility will be broken as needed to implement the decisions for 1. and 2., I'm fine with a 1.5 release as it is (except for the use of version ranges, but I'll argument that in the PR).

I'd like to propose that, no matter the answer to Q1 and Q2, version 1.5 is released with minimal changes, i.e. only changes that are required to resolve #17. In particular: no automatic module name, no changes to the dependency versions, ... Once 1.5 is published, we can then implement 1. and 2. as decided for a 2.0 release.

~~Q3: do you agree with this proposal?~~

Thanks in advance for reading & answering the 3 2 questions above, and thereby bringing back peace and quiet to this repository.

Edit: the comment below demonstrates exactly why I'm saying it's in everyone's interest for a final decision to be taken sooner rather than later: there's a lot of frustration on both sides, and both sides feel the other is counterworking them. That's why I'm asking you to take an authoritative decision on this matter, so we can both accept your decision and move forward on implementing it.

carlosame commented 4 years ago

You are asking questions that people not familiarized with modules cannot answer easily. And no, there aren't 5 or more module combinations that make sense (per your Q1). The maintainers of this project hoped that we could reach some agreement, but it is obviously difficult if not impossible.

everyone's interest for a final decision to be taken sooner rather than later

No, not everyone is in a hurry about that. It has been explained that there won't be any answer before August (and note that "not before August" is not the same as "in August"), and I am fine if for example my question in issue #25 is answered in August or September.

And that's because in the meantime, we are (probably) going to have 1.5 with the automatic module name.

version 1.5 is released with minimal changes [...] In particular: no automatic module name, no changes to the dependency versions

It is not reasonable to block the solution to a serious problem (the lack of a declared module name) because of hypothetical backwards compatibility issues in 2.0 (which is going to be called 2.0 precisely because some backward incompatibilities are possible). My proposed XOM module split (#25) for 2.0 would affect the subset of developers that use htmlparser + XOM, which is likely to be small (while your proposal for a saxtree module would be more invasive and still keeps the XOM problem inside).

Finally, I try to keep a constructive dialogue but your rejection of dependency version upgrades looks like opposing just for the sake of it. Could we spare this modularization drama for now, release 1.5 and retake the matter sometime later?

carlosame commented 4 years ago

Replying to your edit of the original post:

I'm fine with a 1.5 release as it is (except for the use of version ranges, but I'll argument that in the PR).

This is a library and not the final application, so using carefully chosen version ranges is a good practice (albeit not mandatory). I gave more details in my response to your comment in the PR.

hsivonen commented 4 years ago

First of all, I acknowledge the annoyance of having a main branch owner who isn't up-to-speed with Java Modules (it's taken me way more time in terms of calendar distance than I hoped to get some idea of Java Modules) and also is as slow to respond as I have been. Sorry.

Currently, technically the compile-time-hard and run-time-optional dependency on both encoding detectors and on XOM are handled the same way. I see an issue open about XOM, but XOM seems to have more degrees of freedom for the solution, since if you use XOM, you already choose a XOM-specific API entry point anyway. However, the encoding detectors are something you can optionally enable regardless of entry point, so in that sense solving those should solve XOM, too, unless it's a problem for XOM types to appear in the outward API.

Considering the way Java worked up to and including 8, it would have been a backward compatibility bug to change the fully-qualified name of a class that remains otherwise compatible. That is, if you wrote an app with htmlparser.jar in the Java 5 days (and didn't explicitly call the HTML4-enablement methods that will be removed for 2.0), I think it's prima facie a bug if you have to make source changes to your app when you drop in the Java Modules-enabled future nu-validator-htmlparser.jar.

So far, I've understood that there are now restrictions on what packages nu-validator-htmlparser.jar is allowed to provide, and that it's not allowed to provide nu.validator.saxtree. Correct? OK, minting another .jarfor that, while disappointing, isn't that bad, since callers don't need to change any fully-qualified names.

https://github.com/validator/htmlparser/issues/25 says that while requires static would work, it would defeat the purpose of modularization for the public API of the nu.validator.htmlparser module to expose types from nu.xom if nu.xom is optional.

While I can guess that "formally wrong" is bad, is there some articulation of how the badness would manifest in practice?

OTOH, https://github.com/carlosame/htmlparser/tree/xom-removed/src/nu/validator/htmlparser and https://github.com/carlosame/htmlparser-xom/tree/master/src/nu/validator/htmlparser/xom suggest that the module system does allow a module called nu.validator.htmlparser to ship packages called nu.validator.htmlparser.foo while a module called nu.validator.htmlparser.xom ships a sibling package named nu.validator.htmlparser.xom. Since that still doesn't change any fully-qualified names, I guess that isn't any worse than having to ship nu-validator-saxtree.jar as a separate jar.

Considering that the validator project has over the span of its existence gone from me spreading stuff over multiple repos to @sideshowbarker merging some of the repos, I'm a bit uneasy about additional repos and am leaning towards keeping the XOM stuff in this repo if permitted by the rules of Modules and Maven.

Regarding a Maven dependency, I'd hope that after all changes, it would still be possible to point Eclipse to the source directories and the dependency jars and have stuff build without Maven.

As for JDK target, I believe the source code is still Java 5-compatible. Not using new language constructs matters for the Java-to-C++ translator, but actually running the code on a Java 5 JVM is unlikely to be a use case worth supporting. Running on OpenJDK 8 and recent-ish Android does seem relevant, though. Do I understand correctly that modularized jars can be released with the byte code compatibility level set to 8 and then pre-Modules user just dumps all the jars in the classpath, the older JVM ignores the Module manifests, and that's it?

Going back to the encoding detectors: Ideally, the parser would depend on a Java port of chardetng, but one does not exist. The ICU detector isn't very good. In the absence of a Java port of chardetng, jchardet can make some sense. Do I understand correctly that the blocker for depending on it even via requires static is that the project itself doesn't publish with a module manifest?

I've never used the RPM and OSGi stuff myself. Someone contributed them at some point. I have no idea if someone still cares. Probably prudent to leave that stuff in for 1.5 if it's easy to do so, drop that stuff for 2.0 and see if anyone complains.

carlosame commented 4 years ago

it would have been a backward compatibility bug to change the fully-qualified name of a class that remains otherwise compatible

Well, none of the changes being discussed involves changing the fully-qualified name of any class (except for the conversion that I did of the unit test in the proposed new XOM repo where I put it in its parent package, but I do not think that you are talking about this).

you have to make source changes to your app when you drop in the Java Modules-enabled future nu-validator-htmlparser.jar

No you do not need to make source changes if you use Java 7 or 8 (at least if we discuss my modularization patches, are we?). If you use later JDKs, you need to require the declared package name, that's it. And the jar file can be htmlparser.jar as always.

it's not allowed to provide nu.validator.saxtree. Correct?

There is no such a restriction.

is there some articulation of how the badness would manifest in practice?

The point of modularization is to know which modules you require to do certain work. If you start saying "well this may require this and that, or it may not", then modularizing is moot. My point is: if you want to keep the Jar file as it is (containing all the current packages), there is no adavantage in doing a full modularization (which brings its own problem with it: filename-based dependencies), and the project should be happy with the automatic module name.

am leaning towards keeping the XOM stuff in this repo

That's fine, but then my above comment applies.

(PS: it would be great if somebody volunteered to write at least one unit test for XOM :slightly_smiling_face:)

it would still be possible to point Eclipse to the source directories and the dependency jars and have stuff build without Maven.

Yes Maven is just a deployment-stage thing (although you can use a special "Maven" kind of Eclipse project, but I personally do not like them).

Just to clarify: you do not need to put Maven as a dependency in Eclipse.

modularized jars can be released with the byte code compatibility level set to 8 and then pre-Modules user just dumps all the jars in the classpath, the older JVM ignores the Module manifests, and that's it?

If we are discussing my modular patches (the only ones that have been presented) they are compatible with Java 7 and 8, and at the same time modular JDKs (JDK 11+) can get their module-info. I have been using this configuration successfully in css4j since I set it up about a year ago.

the blocker for depending on it even via requires static is that the project itself doesn't publish with a module manifest?

jchardet does not supply an automatic module name in its MANIFEST.MF. This is a PITA (as this project currently is...), but in this particular case it could be less harmful because I do not expect a modular jchardet to ever exist (the project seems abandoned). So in principle, no clash between 'old' and 'new' module names in that case (still bad though).

anthonyvdotbe commented 4 years ago

First of all, I acknowledge [...]

Thanks, I appreciate it.

Considering the way Java worked up to and including 8, it would have been a backward compatibility bug to change the fully-qualified name of a class that remains otherwise compatible. That is, if you wrote an app with htmlparser.jar in the Java 5 days (and didn't explicitly call the HTML4-enablement methods that will be removed for 2.0), I think it's prima facie a bug if you have to make source changes to your app when you drop in the Java Modules-enabled future nu-validator-htmlparser.jar.

Actually, I would like to move some internal classes in nu.validator.htmlparser.common (e.g. the Interner interface) to a non-exported package as part of the modularization in 2.0. So in theory this would require source changes, but in practice I doubt anyone would (and believe no one should) be relying on any of those internal classes. Other than this, no source changes would be required.

So far, I've understood that there are now restrictions on what packages nu-validator-htmlparser.jar is allowed to provide, and that it's not allowed to provide nu.validator.saxtree. Correct?

No, the only technical restriction is that 2 Java modules cannot contain the same package (no matter whether they're exported).

However, a guideline to avoid running into that restriction, is to use the package name of the top-most package as the module name (and to avoid module names that are a prefix of another module name, except for aggregate modules). So if we want to follow that guideline, there ought to be at least 2 modules, nu.validator.htmlparser and nu.validator.saxtree (or the module should be named nu.validator, or nu.validator.saxtree should be moved to nu.validator.htmlparser.saxtree).

W.r.t. modularization, I believe the library should be geared towards the users that "simply" want to parse HTML5 via one of the 3 APIs. So another reason why I'd like 2 separate Java modules, is that allows to effectively hide the saxtree package from users of the htmlparser module. With a single module, the saxtree package would need to be exported, but would be noise to nearly all of the users.

OK, minting another .jarfor that, while disappointing, isn't that bad, since callers don't need to change any fully-qualified names.

What makes you say it's disappointing?

Regarding a Maven dependency, I'd hope that after all changes, it would still be possible to point Eclipse to the source directories and the dependency jars and have stuff build without Maven.

Yes, that'll still be possible, no matter how the modularization is done. (I'm curious why you want Eclipse to build stuff though, rather than relying on Maven or Gradle. If you open a Maven project with Eclipse (i.e. no .classpath, .project etc., just a pom.xml), then Eclipse will simply generate those files by looking at the POM.)

As for JDK target, I believe the source code is still Java 5-compatible. Not using new language constructs matters for the Java-to-C++ translator, but actually running the code on a Java 5 JVM is unlikely to be a use case worth supporting. Running on OpenJDK 8 and recent-ish Android does seem relevant, though. Do I understand correctly that modularized jars can be released with the byte code compatibility level set to 8 and then pre-Modules user just dumps all the jars in the classpath, the older JVM ignores the Module manifests, and that's it?

Exactly.

Going back to the encoding detectors: Ideally, the parser would depend on a Java port of chardetng, but one does not exist. The ICU detector isn't very good. In the absence of a Java port of chardetng, jchardet can make some sense. Do I understand correctly that the blocker for depending on it even via requires static is that the project itself doesn't publish with a module manifest?

Yes. Either jchardet support should be removed, or a new version with a module descriptor should be published. An alternative is to just "fork" it (e.g. create a new repo under https://github.com/validator, or just add it as a separate module in this repo).

Finally, I'd like to reiterate that, in my opinion, Q1 in my OP is the essential question here. I.e. we shouldn't be concerned about project organisation in terms of Maven modules and/or Git repos at this point: no matter what the answer to Q1 is, it's possible to organise the project however you want (though some combinations wouldn't make sense, of course. For example if we'd go with a single Java module, it wouldn't make sense to use multiple Maven modules or Git repos). In this regard, what are your thoughts on the first point?

hsivonen commented 4 years ago

Actually, I would like to move some internal classes in nu.validator.htmlparser.common (e.g. the Interner interface) to a non-exported package as part of the modularization in 2.0.

What problem would this solve?

So in theory this would require source changes, but in practice I doubt anyone would (and believe no one should) be relying on any of those internal classes.

TransitionHandler there was added for NetBeans. Not sure what its usage is today.

nu.validator.htmlparser.impl is deliberately public so that you can write TreeBuilder subclasses that aren't provided by this project. (Also, spinning of XOM into a different module would require TreeBuilder to be subclassable from that module.) Things under common are visible in the signatures of impl.

What makes you say it's disappointing?

Mainly having to replace one jar with many when upgrading and the resulting proliferation of jar. Maybe that's not a real problem.

Yes. Either jchardet support should be removed, or a new version with a module descriptor should be published. An alternative is to just "fork" it (e.g. create a new repo under https://github.com/validator, or just add it as a separate module in this repo).

It's disappointing if the existing jar can't be used as-is. Considering that I'd like to get rid of jchardet eventually, I'm really not keen on de facto becoming responsible for jchardet's distribution beyond its usage by htmlparser. Realistically, though, the replacement I want is vaporware without a concrete path into becoming real software, so in that sense "eventually" can be expected to be far off anyway.

(That is, porting encoding_rs and chardetng to Java are not going to happen as part of my job. Writing a JNI wrapper for them seems like an educational hobby project of realistic scope, but a JNI dependency wouldn't be viewed favorably by the Java community for real-world deployment. Doing a translation that would result either .java files or Java bytecode seems like too large for what I have hobby project time for.)

Is there really no backward compatibility mechanism that would allow the existing jchardet to be used as-is? @carlosame's most recent comment suggests that there is.

Finally, I'd like to reiterate that, in my opinion, Q1 in my OP is the essential question here.

I'm leaning towards three Java Modules:

nu.validator.htmlparser, nu.validator.saxtree, and nu.validator.htmlparser.xom.

However, this is based on a pre-Modules view of the world where DOM and SAX are assumed to be always present, so perhaps it would make sense to treat sax and dom the same way as xom, i.e.:

nu.validator.htmlparser.sax depends on nu.validator.htmlparser, nu.validator.saxtree, java.xml, and java.base.
nu.validator.htmlparser.dom depends on nu.validator.htmlparser, java.xml, and java.base.
nu.validator.htmlparser.xom depends on nu.validator.htmlparser, nu.xom, and java.base.

Considering that nu.xom depends on SAX, it's unclear to me what problem this would solve other than allowing custom tree builders that use none of SAX, DOM, or XOM not to depend on java.xml.

Since you've used Modules and I haven't, do you see any benefit from decoupling the core of the parser from java.xml Module-wise?

Regardless of the module division, I'd prefer all these to stay in this git repo. (As noted, we used to have more repos for the validator as a whole and moved towards having fewer.) What that means for Maven, I don't know.

anthonyvdotbe commented 4 years ago

My point of departure is: only export each of the dom/sax/xom packages plus the "transitive closure" of their API. So in practice this means: additionally export the common package, except for the following interfaces: ByteReadable, EncodingDeclarationHandler, Interner and TokenHandler. (I've only found 2 projects which make advanced use of this project: env-js, which hasn't seen a commit in 9 years, and NetBeans, which doesn't use any of these interfaces. So I believe that moving them would not make a difference in practice. And even if it would, the loss of backward compatibility would be more than compensated by the gain of a minimal exported API.)

By only exporting a minimal API:

classes which are not of interest to users don't show up in the Javadoc, IDE autocompletion suggestions, etc.
maintainers have maximum flexibility w.r.t. backward compatibility

While I understand TreeBuilder is a supported API, I believe the impl package shouldn't be exported, at least not as-is and not without actual demand:

this is much like saxtree: it's useful on its own, but its direct usage is likely limited to a small percentage of this project's consumers
it's a bad package name for an exported package (in fact, given its name, I doubt many users will even have realized that this is supported API)
exporting an additional package later on is a trivial change, whereas "unexporting" one is a breaking change
even if it isn't exported, consumers can trivially circumvent any restrictions (In the case of NetBeans, it builds from source & applies custom patches (source), so they could simply add a patch which exports all packages they want)

W.r.t. project organization, I propose the following: a single Git repo, containing a single multi-module Maven project, containing a number of Maven modules, each defining a single Java module.

For the Java modules, I'd start with your first division + nu.validator.jchardet, where nu.validator.htmlparser would only export 3 packages: common, dom, sax. Then when there's an actual demand for exporting additional packages, we could see how to best meet that demand (it's trivial to create a GitHub issue, so I'm sure people would do so soon enough if they're relying on some of the unexported API. And in the meantime they could just apply the necessary command-line flags or put the JARs on the classpath to work around any module restrictions). One possible solution would be something like the following in that case:

* `nu.validator.jchardet`               { exports `jchardet`;                                                                                                                                            }
* `nu.validator.saxtree`                { exports `saxtree`;                                                                                                                                             }

* `nu.validator.htmlparser.common`      { exports `common`;      requires transitive `java.xml`;                                                                                                         }

* `nu.validator.htmlparser.base`        { exports `treebuilder`; requires transitive `nu.validator.htmlparser.common`;           requires `nu.validator.jchardet`;                                       }
* `nu.validator.htmlparser.dom`         { exports `dom`;         requires transitive `nu.validator.htmlparser.common`;           requires `nu.validator.htmlparser.base`;                                }
* `nu.validator.htmlparser.sax`         { exports `sax`;         requires transitive `nu.validator.htmlparser.common`;           requires `nu.validator.htmlparser.base`, `nu.validator.saxtree`;        }
* `nu.validator.htmlparser.xom`         { exports `xom`;         requires transitive `nu.validator.htmlparser.common`, `nu.xom`; requires `nu.validator.htmlparser.base`;                                }

Some notes:

the only difference with your second division, is that I've further split nu.validator.htmlparser into a common and base module. This allows to maintain the minimal API in the dom/sax/xom modules
even though this looks complex, 99% of the commits would only affect the nu.validator.htmlparser.base module (I made up the number after looking at the history & noting that other packages have barely seen 2 commits in the past 6 years)
aggregate modules, which contain nothing more than a module descriptor which transitively requires other modules, can be used to maintain backward compatibility after splitting a module (e.g. nu.validator.htmlparser would just be an aggregate of nu.validator.htmlparser.dom and nu.validator.htmlparser.sax)

What makes you say it's disappointing?

Mainly having to replace one jar with many when upgrading and the resulting proliferation of jar. Maybe that's not a real problem.

I don't see the problem: when I update a dependency declaration in the POM, I hardly ever notice that the new version has introduced additional transitive dependencies.

Yes. Either jchardet support should be removed, or a new version with a module descriptor should be published. An alternative is to just "fork" it (e.g. create a new repo under https://github.com/validator, or just add it as a separate module in this repo).

It's disappointing if the existing jar can't be used as-is. Considering that I'd like to get rid of jchardet eventually, I'm really not keen on de facto becoming responsible for jchardet's distribution beyond its usage by htmlparser. Realistically, though, the replacement I want is vaporware without a concrete path into becoming real software, so in that sense "eventually" can be expected to be far off anyway.

(That is, porting encoding_rs and chardetng to Java are not going to happen as part of my job. Writing a JNI wrapper for them seems like an educational hobby project of realistic scope, but a JNI dependency wouldn't be viewed favorably by the Java community for real-world deployment. Doing a translation that would result either .java files or Java bytecode seems like too large for what I have hobby project time for.)

Is there really no backward compatibility mechanism that would allow the existing jchardet to be used as-is? @carlosame's most recent comment suggests that there is.

Yes, one can use any JAR as-is, and the JVM will deduce a module name from the JAR's filename. However, this is rather fragile and precludes some usages of the modules that require it (e.g. I'm pretty sure that it's impossible to use such modules with jlink to generate a custom runtime image). In the proposal above, I added it as a separate module, but if preferred, we could even add it as a non-exported package in the nu.validator.htmlparser.base module.

Considering that nu.xom depends on SAX, it's unclear to me what problem this would solve other than allowing custom tree builders that use none of SAX, DOM, or XOM not to depend on java.xml.

Since you've used Modules and I haven't, do you see any benefit from decoupling the core of the parser from java.xml Module-wise?

Maybe I'm misunderstanding your question, but I doubt it's possible to decouple any piece of the parser from java.xml, since e.g. SAXException is used in the public API of pretty much every package.

hsivonen commented 4 years ago

classes which are not of interest to users don't show up in the Javadoc, IDE autocompletion suggestions, etc.

The clutter in Javadoc doesn't look too bad. Not sure about how much the internals show up in IDE autocomplete in practice for people who don't work on the internals.

maintainers have maximum flexibility w.r.t. backward compatibility

AFAICT, in the 13-year lifespan of this project, the one API-breaking change (as opposed to parser behavior correctness change) is the removal of support for the HTML4 mode, which would not have become non-breaking had the change you propose been made ahead of time.

nu.validator.htmlparser.xom.HtmlBuilder has these nu.validator imports:

import nu.validator.htmlparser.common.CharacterHandler;
import nu.validator.htmlparser.common.DocumentModeHandler;
import nu.validator.htmlparser.common.Heuristics;
import nu.validator.htmlparser.common.TokenHandler;
import nu.validator.htmlparser.common.TransitionHandler;
import nu.validator.htmlparser.common.XmlViolationPolicy;
import nu.validator.htmlparser.impl.ErrorReportingTokenizer;
import nu.validator.htmlparser.impl.Tokenizer;
import nu.validator.htmlparser.io.Driver;

nu.validator.htmlparser.xom.XOMTreeBuilder has these nu.validator imports:

import nu.validator.htmlparser.common.DocumentMode;
import nu.validator.htmlparser.impl.CoalescingTreeBuilder;
import nu.validator.htmlparser.impl.HtmlAttributes;

It is the design intent of the parser that a third party is allowed to write the kind of wrapper that nu.validator.htmlparser.xom is. Hiding nu.validator.htmlparser.impl seems contrary to that goal. (Whether impl is a good name is too late to debate: If we're not changing fully-qualified names, we're stuck with that name.)

Maybe I'm misunderstanding your question, but I doubt it's possible to decouple any piece of the parser from java.xml, since e.g. SAXException is used in the public API of pretty much every package.

Good point. I had forgotten about that.

Additonal observation: Enabling normalization checking depends on ICU4J.

@carlosame, @sideshowbarker, do you see value in the common/base split that @anthonyvdotbe suggests?

carlosame commented 4 years ago

do you see value in the common/base split

The base module exports treebuilder which is a new package of unknown contents, so nothing to comment about. I'd rather see the actual code to check whether this breaks backwards compatibility.
The common package bytecode is barely 5K compressed, not even 9K uncompressed. Modules are intended to avoid classpath clutter, but I believe that a 5/9 Kb gain is not worth the extra effort to handle the additional module (with its jar file).

And BTW exporting jchardet goes in the wrong direction (while at the same time the proposal is not exporting io nor the other packages).

sideshowbarker commented 4 years ago

do you see value in the common/base split that @anthonyvdotbe suggests?

I don’t personally see value in it. But I’m not a domain expert around the fine points of packaging, and I don’t feel strongly enough about it that I’d object to it. That said, I have a general preference for keeping things as simple as possible.

anthonyvdotbe commented 4 years ago

Fair enough, so I think we can agree on the following:

3 Java modules: nu.validator.htmlparser, nu.validator.htmlparser.xom, nu.validator.saxtree
no classes are moved (i.e. forget about my proposal to move Interner et al.) & all packages are exported to maintain 100% backward compatibility
organised in a single Git repo with a Maven multi-module project

And have one open question: what with jchardet? I'm ok with any proposal (mine being: move it into this repo as an additional nu.validator.jchardet module), as long as we don't have to rely on the automatic module name.

hsivonen commented 4 years ago

Fair enough, so I think we can agree on the following:

3 Java modules: nu.validator.htmlparser, nu.validator.htmlparser.xom, nu.validator.saxtree

no classes are moved (i.e. forget about my proposal to move Interner et al.) & all packages are exported to maintain 100% backward compatibility

organised in a single Git repo with a Maven multi-module project

OK.

And have one open question: what with jchardet? I'm ok with any proposal (mine being: move it into this repo as an additional nu.validator.jchardet module), as long as we don't have to rely on the automatic module name.

There's also another question whose answer might inform the answer to that question: What to do with the ICU4J normalization checking dependency. https://github.com/unicode-org/icu/blob/master/icu4j/manifest.stub says Automatic-Module-Name: com.ibm.icu, so I assume nu.validator.htmlparser would have requires static com.ibm.icu. Correct?

In any case, I think jchardet doesn't belong in this repo. I'd still like to avoid having to repackage it if possible.

@carlosame, can you corroborate the incompatibility of the existing jchardet with jlink?

carlosame commented 4 years ago

can you corroborate the incompatibility of the existing jchardet with jlink?

Both jlink and jpackage do not work well with automatic modules, however there is a tedious workaround using jdeps that I haven't tested myself. Two tools claim to do that process automatically for you though:

The JLink-JPackager Maven Plugin.
The Moditect Maven Plugin.

If you want to make this project more friendly to jlink, the proper solution would be to create a jchardet package inside the main htmlparser module and do not export it, instead of creating a specific jchardet module as @anthonyvdotbe proposes.

But in any case, if you ship jchardet classes with your jar at some point you may find some guy using Java 8 that directly calls a jchardet class from this project's jar file.

I assume nu.validator.htmlparser would have requires static com.ibm.icu

I'd rather have requires com.ibm.icu without the static, which is what my initial patchset had and I should have not changed.

And needless to say that I see no point in having a separate saxtree module as it has no realistic stand-alone use cases, my opinion has been already expressed several times (just making sure that I'm not a recipient for the "we can agree").

anthonyvdotbe commented 4 years ago

[...] so I assume nu.validator.htmlparser would have requires static com.ibm.icu. Correct?

Correct.

In any case, I think jchardet doesn't belong in this repo. I'd still like to avoid having to repackage it if possible.

Ok, let's continue using jchardet as is, since there's workarounds for the problems it poses.

@carlosame are you planning to provide a PR for this? (just asking to avoid duplicate efforts)

carlosame commented 4 years ago

@carlosame are you planning to provide a PR for this?

I already provided a modularization patch (without Maven modules) in July 5 (PR #23) and it hasn't received a lot of attention. Switching that PR to a modular Maven project (with htmlparser and xom modules) would not be difficult, but I'd rather not do that because the conversion of the repository to Maven modules (not JPMS modules) should be done by @hsivonen following instructions from the author of the modularization patches being applied later.

Maven modules involve a different directory layout (with a directory for each Maven module), so the files should be moved in a way that Git history is not completely lost (git mv). Thus, IMHO Sivonen should be the author of the mv if we want to mess with git blame as less as we could.

Responding specifically to your question, I have no plans to provide a PR with a saxtree module (and that rules me out given your agreement with @hsivonen about modularization). I suggest that you proceed with him to switch the repository to a Maven modular layout, and once you have achieved that you can add the module-info's etc. (or at least that would be my suggestion on what to do next).