snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

JS/TS ESM version (modern JS module system vs global scope polution) #183

Open lionel-rowe opened 1 year ago

lionel-rowe commented 1 year ago

The current JS build script outputs files with the following shape (example for English):

StemmerEnglish = function() {
    // ...
}

In non-strict-mode (e.g. root-level non-modularized JavaScript), this automatically adds StemmerEnglish to the global scope, which is less-than-ideal but generally doesn't cause problems. However, in strict-mode (e.g. a JavaScript module), this throws Uncaught ReferenceError: StemmerEnglish is not defined. This makes it impossible to run on Deno, which always uses strict mode.

In addition, the JSDoc types added aren't picked up by TypeScript.

Is there interest in adding an ESM build, either as a replacement for or in addition to the current JS one? I've managed to hack one together in my fork (README, compilation logic, dynamic imports by locale code, tests), and it works nicely in Deno, newer versions of Node, and in-browser HTTP imports, with TypeScript-available types out of the box. I could probably work it into a PR with some effort (I don't know much C), but I'd need a steer on a couple of things:

ojwb commented 1 year ago

There was also #123 which I closed as abandoned (it broke CI so it needed more work before it could be merged).

The current structure is probably partly due to my inexperience with modern javascript and partly due to the code's history. Snowball's javascript backend was originally a JSX backend (not the JSX you're likely thinking of, but https://jsx.github.io/ which was "a faster, safer, easier JavaScript" that transpiled to javascript). The JSX project is inactive, so I morphed it into producing javascript code directly, with the primary personal motivation being to have a demo on the website which could work without any server-side support. That was also more than 5 years ago.

If we could provide a single output that without real drawbacks that would be simpler. Does ESM work with all browsers? What older node versions don't support it?

  • Would something like getStemmerByLocale.mjs be considered valuable to include in the build

I'm guessing from the name that selects a stemmer based on the current locale setting? That doesn't seem like a useful thing to do as you need to use the same stemming algorithm when searching that you did at index time and that's typically a per-database thing, whereas the locale is per user (and can typically be changed by the user too).

lionel-rowe commented 1 year ago

If we could provide a single output that without real drawbacks that would be simpler. Does ESM work with all browsers?

ESM is supported by all modern browsers (no IE or Opera Mini, of which IE is officially dead and Opera Mini has patchy support for JavaScript in general).

The upgrade for in-browser usage relying on the old global behavior would typically look something like this:

- <script src="./path/to/base-stemmer.js"></script>
- <script src="./path/to/english-stemmer.js"></script>
+ <script type="module">
+ import StemmerEnglish from "./path/to/stemmer-english.mjs"
+ globalThis.StemmerEnglish = StemmerEnglish
+ </script>

Or for NodeJS (ESM) usage relying on the global behavior:

- import "./path/to/base-stemmer.js"
- import "./path/to/stemmer-english.mjs"
+ import StemmerEnglish from "./path/to/stemmer-english.mjs"
+ globalThis.StemmerEnglish = StemmerEnglish

What older node versions don't support it?

In NodeJS, experimental support for ESM was added in 8.5.0 (Sep 2017), with unflagged support added in 13.2.0 (Nov 2019). Current latest is 20.5.1, LTS is 18.17.1. End-of-life for 12.x (most recent LTS without unflagged support) was Apr 2022.

However, newer versions of NodeJS continue to support the old style of "CJS" modules (require, module.exports, etc), which are still pretty widely used in legacy codebases. CJS modules can import ESM modules, but only asynchronously.

Interestingly, NodeJS ESM imports based on the "implicit global polution" pattern don't throw an error (as would be expected in strict-mode) — my guess is that this is a deviation from the spec to support with porting codebases from CJS.

I'm guessing from the name that selects a stemmer based on the current locale setting? That doesn't seem like a useful thing to do as you need to use the same stemming algorithm when searching that you did at index time and that's typically a per-database thing, whereas the locale is per user (and can typically be changed by the user too).

Not exactly — it takes a locale as a parameter, rather than using the current locale setting (though you could also import one based on current locale setting by passing it as a param). My use case is that I want to use any relevant stemmer based on multilingual, user-supplied files, of which the locale is known but varies per file.

mitya57 commented 1 year ago

However, in strict-mode (e.g. a JavaScript module), this throws Uncaught ReferenceError: StemmerEnglish is not defined. This makes it impossible to run on Deno, https://github.com/denoland/deno/issues/5845.

As a starter, can we prefix the declaration with const to fix this error?

lionel-rowe commented 1 year ago

However, in strict-mode (e.g. a JavaScript module), this throws Uncaught ReferenceError: StemmerEnglish is not defined. This makes it impossible to run on Deno, denoland/deno#5845.

As a starter, can we prefix the declaration with const to fix this error?

If you make that change and nothing else, it makes the variable local to the module, so nothing is exported (importing the module becomes a no-op).

I think the most minimal working ESM/strict-mode-compatible version that doesn't pollute the global scope could be generated something like this:

mkdir -p out
cat <(echo "let BaseStemmer\nexport default ") ./javascript/base-stemmer.js  > ./out/base-stemmer.mjs
./snowball algorithms/english.sbl -js -o out/stemmer-english \
  -n "import BaseStemmer from './base-stemmer.mjs'; let StemmerEnglish; export default StemmerEnglish"
mv ./out/stemmer-english.js ./out/stemmer-english.mjs

Works in both NodeJS and Deno:


cat <(echo "import StemmerEnglish from './stemmer-english.mjs'\nconsole.log(new StemmerEnglish().stemWord('location'))") > ./out/main.mjs

echo "result from node: $(node ./out/main.mjs)"
# result from node: locat
echo "result from deno: $(deno run ./out/main.mjs)"
# result from deno: locat
ojwb commented 1 year ago

It seems almost all users should be happy with ESM output (and many probably happier), but it's conceivable a few might still want the current output. Maybe nobody would, but unfortunately it's hard to survey more than a very limited number of users. For anyone still relying on pre-ESM support, it's probably really unhelpful to have a dependency force the timing of when they drop support for older javascript implementations. People in this situation can choose to use with an older Snowball version, but then they don't get other changes from newer versions which they may want.

So I think I'm leaning towards trying to keep the existing output as an option, and if that proves complicated we can reassess.

Thinking about command line option naming we don't have many target-language specific options. There are a few like -eprefix and -vprefix which are specific to C but these are a special case as Snowball originally only targetted C, and -gopackage and -goruntime are specific to Go. There are also some options that are specific to a subset of languages.

With the motivation of trying to keep it clearer what options are relevant to a particular target language, I wonder if we should follow Go's lead of prefixing with the language name, e.g. -jsesm or specify this via an optional argument to the existing option, e.g. -js=esm. The latter would also naturally allow e.g. -js=global to explicitly ask for the current behaviour (maybe there's a better name for it than "global").

ojwb commented 1 year ago

I've pushed changes to add support for -js=esm, which hopefully produces code equivalent to that in your repo. Please give it a go and let me know how it goes. I've not tried running it with deno (doesn't seem to be packaged for Debian, and their own installation instruction lost my interest when they recommended curl ... | sh). If someone can add a CI job using deno I think that would be useful.

I haven't tried to merge in the changes between base-stemmer.js and base-stemmer.mjs yet - it'd really be preferable to just have one version of this, or at least some way to avoid duplicating most of the contents so we don't have to make the same change in two places in the future.

I didn't try to generate code with the various style changes you'd made - they're likely possible but I'd like evidence that there's firm consensus in the JS world about such points or at least a strong argument for why we should produce code that looks a particular way, rather than these just being matters of personal preference. A data point is you've reformatted to use tabs for indenting, but we changed to using spaces for indentation here due to #123. I really don't want to get sucked into acting as a mediator between warring Javascript style factions!

Re: locales, I see from the code you really just mean a way to create a stemmer object given an ISO-639 language code string - that seems fine (and e.g. the C libstemmer can create a stemmer for a given language name or ISO-639 language code string).

lionel-rowe commented 1 year ago

I've pushed changes to add support for -js=esm, which hopefully produces code equivalent to that in your repo. Please give it a go and let me know how it goes.

Nice! I'll give it a try with Deno when I have time.

I haven't tried to merge in the changes between base-stemmer.js and base-stemmer.mjs yet - it'd really be preferable to just have one version of this, or at least some way to avoid duplicating most of the contents so we don't have to make the same change in two places in the future.

I'm not sure if this will be possible without global scope pollution, though there might be a way to check and only pollute global scope if the context isn't ESM.

I really don't want to get sucked into acting as a mediator between warring Javascript style factions!

Nor me — I just formatted that way on my fork as that's what I use as config for my auto-formatter. I wouldn't presume to advocate for that change in the official repo.

ojwb commented 1 year ago

I'm not sure if this will be possible without global scope pollution, though there might be a way to check and only pollute global scope if the context isn't ESM.

One version which dynamically works for both is not the only approach - we could generate one from the other, or generate both variants from some sort of template. That would probably happen at the same stage where we run snowball to generate the JS sources.

I really don't want to get sucked into acting as a mediator between warring Javascript style factions!

Nor me — I just formatted that way on my fork as that's what I use as config for my auto-formatter. I wouldn't presume to advocate for that change in the official repo.

OK, cool.

To be clear I'm OK with people proposing such changes where there's a good reason. If we're expecting people to use the generated code as-is in the browser, formatting which makes it smaller is arguably helpful as it's currently 952K of JS for all of them. Simply s/ /\t/g reduces that to 668K. Also doing s/;$//g (which at least approximates omitting semicolons) to 648K. "Smaller is better" would at least be a fairly objective rule to guide decisions.

However at least with the current size I'd expect people to be running it through a minifier tool - for our own website I'm using closure-compiler which reduces that to 272K. (Actually all the size numbers above include demo.js which implements the website demo but that is only 8K before.)

Perhaps we should include a minification step. I don't really do enough JS to know if that's actually helpful though.

lionel-rowe commented 1 year ago

Perhaps we should include a minification step. I don't really do enough JS to know if that's actually helpful though.

IMO it's not really necessary, as the library doesn't provide any pre-built version. It makes sense to leave minification up to consumers (or downstream providers of pre-built versions), as they already need to produce their own build output. Perhaps stick a note in the README about a suggested way to minify - e.g. something like this would work in Deno (with output being consumable by various environments):

import { transform } from "https://deno.land/x/swc@0.2.1/mod.ts";

const dirName = "javascript";

for await (const f of Deno.readDir(dirName)) {
  const segs = f.name.split(".");
  if (!f.isFile || segs.includes("min") || !segs.at(-1).endsWith("js")) continue;

  await Deno.writeTextFile(
    `${dirName}/${[...segs.slice(0, -1), "min", segs.at(-1)].join(".")}`,
    transform(await Deno.readTextFile(`${dirName}/${f.name}`), { minify: true }).code,
  );
}

One version which dynamically works for both is not the only approach - we could generate one from the other, or generate both variants from some sort of template. That would probably happen at the same stage where we run snowball to generate the JS sources.

From looking into it more, it seems there's no way of properly feature-checking for ESM contexts. import either simply throws a non-catchable syntax error (in non-ESM contexts) or it doesn't (in ESM contexts).


I noticed in your recent changes you added require statements and if (typeof module === 'object' && module.exports) module.exports = ..., which are used for CommonJS (CJS), a legacy type of modules that's still widely used in NodeJS, as for a long time NodeJS didn't support ESM (CJS also predates ESM's existence).

Having thought on it a bit more, here's an outline for build logic I think should work for all 3 of ESM, CJS, and legacy/compatibility global mode:

* "Bare version": just the const <CLASS_NAME> = function () { ... } declaration, without any way of exporting it

** "Relevant boilerplate":

ojwb commented 1 year ago

I noticed in your recent changes you added require statements and if (typeof module === 'object' && module.exports) module.exports = ..., which are used for CommonJS (CJS), a legacy type of modules that's still widely used in NodeJS, as for a long time NodeJS didn't support ESM (CJS also predates ESM's existence).

I pretty much just resolved the issues with the parts of #123 which didn't get applied before because they broke CI, then merged them. If you're only suggesting we support CJS because I merged that, that's not a compelling reason to (I really have little idea what's current in JS, and that PR was originally from 2019). If it's actually still useful that seems reasonable though.

I should be able to easily sort out the codegen tweaks, but it'd be helpful if someone with more idea than me could set up suitable additional CI builds to actually test the three variants. I don't know how feasible it is, but it'd also be useful if we could have a CI job testing the JS code running inside a browser for the variants where that makes sense.

lionel-rowe commented 1 year ago

If it's actually still useful that seems reasonable though.

Results from GitHub code search:

Not completely scientific as import doesn't 100% mean ESM and require doesn't 100% mean CJS, but most of the results seem relevant. So ESM is in the majority, but not overwhelmingly so (except in TypeScript).

I should be able to easily sort out the codegen tweaks, but it'd be helpful if someone with more idea than me could set up suitable additional CI builds to actually test the three variants. I don't know how feasible it is, but it'd also be useful if we could have a CI job testing the JS code running inside a browser for the variants where that makes sense.

In-browser CI can be done with tools like Selenium, Cypress, or Puppeteer, but IMO it's massive overkill for this use case (usually those tools are reserved for things like end-to-end testing of UI interactions). ESM and global versions should work identically in browser, and CJS isn't relevant to browsers.

ojwb commented 1 year ago

Not completely scientific [...]

Also repos for projects that are no longer actively maintained generally persist rather than get removed which will tend to skew towards older ways of doing things, but it does seem like that couldn't reasonably explain all of it.

ESM and global versions should work identically in browser

OK, that seems reasonable then - we can always revisit if we hit a situation which wasn't caught by CI but would have been with in-browser testing.

If I have things straight, CI testing currently essentially tests a kind of hybrid which dynamically supports CJS or global, so I can probably manage to split that into a job for each. If someone can set up CI jobs which test ESM with javascript and typescript that would be helpful, probably using deno since that appears to have a stricter implementation of ESM than node (from comments above).