javacc21 / javacc21

JavaCC
Other
69 stars 16 forks source link

Generated constructors mean that it's hard to ensure lexers/parsers are properly initialized #112

Open vsajip opened 2 years ago

vsajip commented 2 years ago

The generated code for the preprocessor contains the following definitions:

    private final BitSet lineMarkers = new BitSet();

    public PreprocessorParser(Path path, Map<String, String> definedSymbols, boolean csharpMode) throws IOException {
        this(path);
        this.csharpMode = csharpMode;
        if (definedSymbols != null) addSymbols(definedSymbols);
        token_source.switchTo(LexicalState.PP_SCAN_LINES);
    }

    public PreprocessorParser(String inputSource, CharSequence content) {
        this(new PreprocessorLexer(inputSource, content));
    }

    public PreprocessorParser(CharSequence content) {
        this("input", content);
    }

    public PreprocessorParser(String inputSource, Path path) throws IOException {
        this(inputSource, FileLineMap.stringFromBytes(Files.readAllBytes(path)));
    }

    public PreprocessorParser(Path path) throws IOException {
        this(path.toString(), path);
    }

    public PreprocessorParser(java.io.InputStream stream) {
        this(new InputStreamReader(stream));
    }

    public PreprocessorParser(Reader reader) {
        this(new PreprocessorLexer("input", reader));
    }

    /** Constructor with user supplied Lexer. */
    public PreprocessorParser(PreprocessorLexer lexer) {
        token_source = lexer;
        lastConsumedToken.setInputSource(lexer.getInputSource());
    }

All except the first are generated from the template. The problem with the proliferation of constructors is that there is no one place where you can put initialization code and guarantee that it will be called no matter which constructor is used.

For example, if we want to initialize lineMarkers in the first constructor above, we can remove its initializer from the field declaration and then initialize it in the injected constructor; but that's not enough. You have to remove the final qualifier from it as well, else compilation would fail; and that highlights the fact that none of the other constructors would initialize it.

Another approach seems to be needed ... for example, a private commonInit() method that is called from every constructor and which does the initialization common to all constructors, and which is automatically called from all constructors, generated or injected. I'm not sure if the Java constructor would allow final fields to be initialized in such a method, though. Is there some other approach that can be taken? Perhaps the generated constructors can be omitted if there is at least one injected constructor, leaving it to the user to define multiple constructors in the grammar if they want support for Readers, InputStreams etc. It's then their responsibility to do correct construction, and we don't give them any foot-guns in the form of auto-generated constructors.

The foregoing applies to lexers, too, of course. What do you think?

revusky commented 2 years ago

I just realized I had never answered this one. I think that, generally, there is a need (or it least it would be desirable) to streamline the generated constructors. It is generating 7 different constructors in XXXParser and that is supposed to be some sort of convenience but I have doubts that it really is. More like just a source of confusion. I think that constructors that take a java.io.Reader or java.io.InputStream should probably just be got rid of, since, with the latest round of refactoring, we're not buffering input when we read in a file anyway, so in those cases, the whole thing is now kind of based on a fiction. Now we just slurp the whole thing in and and put it in a character buffer basically. There are, of course, a few messy details when we read it in, like the encoding or whether we preserve tabs or CR-LF and things like this. But we just deal with all that in one fell swoop when we read in the file and I think that's really just a lot simpler and more maintainable. Basically, the parser/lexer we generate works on a character buffer in memory. We can have a constructor or two that takes java.nio.file.Path and that should take care of reading a file off the disk mostly.

It occurs to me that there may be a bit of confusion now as to whether, when you pass in a CharSequence as a parameter to the constructor, whether the content has already been "munged". I mean, if you are passing in a character buffer in which all the conversions have already taken place, all Java unicode escapes converted just to characters for example, it is incorrect to run over it again trying to convert the escapes. I mean, there should be greater clarity as to whether you're passing in a character buffer that needs conversions done or not. If you pass in a Path, we are reading from the disk, so we do what needs to be done, like figure out the right character encoding, which is hardly much of an issue nowadays, because everybody just uses UTF-8, it seems. In fact, the thing about being able to specify character encodings via a regexp at the top of the file, which is part of the Python spec, I still haven't got round to implementing that (though it's on my mental todo list). Actually, my motivation for doing that is mostly so that a few more of the files that don't parse (there are 6 currently) in the standard Python lib will now parse. Those are just some test-case files. Again, it's almost totally for nothing. Every modern system uses UTF-8 now, AFAICS...

As for the rest of it, it doesn't really seem to me that there is any particular need for some of these fields to be final anyway. I just wrote them initially like that, thinking that there was more clarity that way, but if it causes a problem, just get rid of the "final" modifier, I would say. Them being private is plenty good enough, it seems to me.

Gravelbones commented 2 years ago

There should be a way to do character set conversion on the way in, but leave all the other munging to the lexer. The project I am working on, works on files that dates many decades back. Before Unicode came along. Yes, I do have to figure out, which of the many DOS codepages or any of the other character sets, most people have long forgotten, was used. But even today there are other characters sets than unicode, frigging windows comes to mind, with their codepage 1252. But also https://en.wikipedia.org/wiki/MARC-8 from Library of Congress, In my previous job, we had around 8 different character encodings in our input, over a decade after unicode. So to keep this project valuable you should be careful to assume, everyone uses unicode. But the main problem of this issue, is still the initialization. The C/C++ preprocessor I am working on, will need to initialize some predefined symbols. Best done in one place.