snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

add Go generator #57

Closed mschoch closed 7 years ago

mschoch commented 7 years ago

initial implementation ported from rust generator primarily focused on getting functionality working not on the best or most performant Go code

seems to work, based on: make check_go and other ad-hoc testing

mschoch commented 7 years ago

What is the preferred style of addressing review comments for this project? Would you like separate commits that can be squashed later? Or force push into single commit now?

ojwb commented 7 years ago

fixup commits is probably most helpful.

mschoch commented 7 years ago

Addressed some of the initial problems. Next I'll be making the larger change to support signed integers throughout.

mschoch commented 7 years ago

I've now updated the implementation to support signed integers (both as a datatype, and internally for tracking cursor/limit/etc as is done in the C impl)

I identified one example program where the cursor can be set negative, and I updated the runtime to behave like the C version (it's not clear if this is well specified or not).

The example I considered is this:

integers ( p1 )

externals ( stem )

define stem as (
backwards(
  tomark -2
  = 'bobo'
  )
)

There may be other operations that need to be updated to behave correctly if the cursor becomes negative (or perhaps this is just considered undefined or poor snowball programming).

ojwb commented 7 years ago

Or perhaps better, use constants MaxInt32 and MinInt32 from the math package instead: https://golang.org/pkg/math/#pkg-constants

ojwb commented 7 years ago

Snowball has the concept of undefined operations (e.g. The effect of obeying among when the preceding substring is not obeyed is undefined), but the manual defines tomark as tomark AE moves c forward to the position given by AE, and when AE is negative we can't move c forward to it so that should probably be a no-op.

That's also how the C backend (which is essentially the reference implementation as it was the original backend) seems to handle it.

ojwb commented 7 years ago

Oh, I see you mean in backwards mode when trying to position c before the start of the string. I think that should signal f - In the case of tomark AE , a similar fail condition occurs as with hop AE. If c is already beyond AE, or if position l is before position AE, the signal is f. and String commands have been described with c to the left of l and moving right. But the process can be reversed. backwards C c and l are swapped over, and c moves left towards l

mschoch commented 7 years ago

Sure, I can use the constants in the math package, but then it complicates the generation because they won't always be imported, and imported-but-not-used packages are an error in Go.

I already avoid conditionally needing the utf8 package by wrapping the method I need in the runtime package. I could do the same for the math symbols. But you seem have something really specific you want.

mschoch commented 7 years ago

Can you explain when you mean when you say, "I'm not aware of a valid way to make the cursor or limit negative"?

In this program:

integers ( p1 )

externals ( stem )

define stem as (
backwards(
  tomark -2
  = 'bobo'
  )
)

My undrstanding is that tomark sets the cursor to -2.

When I look at the C code I see the line:

z->c = -2; /* tomark, line 7 */
ojwb commented 7 years ago
z->c = -2; /* tomark, line 7 */

That looks like a bug in the C backend to me - thanks for pointing it out. There's a gating check generated, but what it checks doesn't match what the docs say:

    if (z->c < -2) return 0;
    z->c = -2; /* tomark, line 7 */

It's more clearly wrong when not in backwards mode:

    if (z->c > 1000000) return 0;
    z->c = 1000000; /* tomark, line 10 */

That checks if c is already beyond AE, but there's no check for if position l is before position AE.

mschoch commented 7 years ago

@ojwb thanks for the info on tomark and hop. So, after re-reading the spec, at a high level, both hop and tomark should have checks ensuring that the new cursor position is valid. And I think if we get this part right, my concerns about signed integers affecting cursor/limit locations should go away.

mschoch commented 7 years ago

I spent some more time looking at hop, the specification and the generated C code.

First, the spec says "moves c AE character positions towards l, but if AE is negative, or if there are less than AE characters between c and l the signal is f"

Sample program:

externals ( stem )

define stem as (
  hop 2
  hop -1
  = 'b'
)

On the input string 'beautiful', this produces 'bb'.

So, first it seems the negative AE is allowed, despite what the spec says.

It does move c "closer" to l by one character, when you realize that closer means farther in the negative direction.

Looking at the generated C code, it appears that if AE is positive, you can't exceed l. And if AE is negative, you can't go less than 0.

Do you agree with the C implementation here? It seems to be safe and generally useful, and the spec just needs slightly different wording.

mschoch commented 7 years ago

I believe I've addressed the issues you've raised so far, and I'm committed to continue working on the tomark and hop issues we've identified, once we agree on what the C version should do.

I've also added a travis CI job as well.

Are there any remaining barriers to getting this merged?

ojwb commented 7 years ago

Thanks for hooking up the CI.

I don't think there are any blockers - I just need to actually sit down and give it one last check over, then merge assuming it looks OK.

Extending the spec to allow hop with a negative argument indeed seems reasonable.