go-spatial / geom

Geometry interfaces to help drive interoperability within the Go geospatial community
MIT License
168 stars 37 forks source link

wkt_decode: readWhitespace does not skip all white space. #87

Open gdey opened 4 years ago

gdey commented 4 years ago

readWhitespace assumes all text is ASCII. The function should decode runes correctly and then check to see if they are spaces.

Can not test runes this way:

https://github.com/go-spatial/geom/blob/master/encoding/wkt/wkt_decode.go#L55-L57

You need to read all the bytes in the run before testing it.

Here is a test case with the failures:


func TestReadWhitespace(t *testing.T) {
    type tcase struct {
        in    string
        match bool
    }
    fn := func(tc tcase) func(*testing.T) {
        decoder := NewDecoder(strings.NewReader(tc.in + "."))

        return func(t *testing.T) {
            match, err := decoder.readWhitespace()
            if err != nil {
                t.Errorf("error, expected nil got: %v", err)
                return
            }
            if match != tc.match {
                t.Errorf("match, expected %t, got %t", tc.match, match)
            }
        }
    }

    tests := map[string]tcase{
        "regular0":     tcase{in: " ", match: true},
        "tab":          tcase{in: "\t", match: true},
        "newline":      tcase{in: "\n", match: true},
        "return":       tcase{in: "\r", match: true},
        "regular":      tcase{in: "\u0020", match: true},
        "nobreak":      tcase{in: "\u00A0", match: true},
        "ogham mark":   tcase{in: "\u1680", match: true},
        "en quad":      tcase{in: "\u2000", match: true},
        "em quad":      tcase{in: "\u2001", match: true},
        "en":           tcase{in: "\u2002", match: true},
        "em":           tcase{in: "\u2003", match: true},
        "three-per-em": tcase{in: "\u2004", match: true},
        "four-per-em":  tcase{in: "\u2005", match: true},
        "six-per-em":   tcase{in: "\u2006", match: true},
        "figure":       tcase{in: "\u2007", match: true},
        "punctuation":  tcase{in: "\u2007", match: true},
        "thin":         tcase{in: "\u2009", match: true},
        "hair":         tcase{in: "\u200A", match: true},
        "nnbsp":        tcase{in: "\u202F", match: true},
        "mmsp":         tcase{in: "\u205F", match: true},
        "ideographic":  tcase{in: "\u3000", match: true},
    }
    for key, test := range tests {
        t.Run(key, fn(test))
    }

}
ear7h commented 4 years ago

Do you have a WKT string test case for this?

As far as I've read, the spec does not specify any unicode characters, which is why the Decoder can/does read bytes and not runes.

I could add a check to ensure the characters are within a valid ascii range in readByte

gdey commented 4 years ago

We are reading in UTF8 text, the spec does not say that it must be ASCII; so we should not assume ASCII only spaces. The other characters are unlikely to have an issue with UTF8 as they are equivalent, so I did not call it out -- though the errors can be funky.