Closed fawick closed 7 years ago
Alternatively, instead of ShapeFields
, the ZipReader
might have a Shape() (int, Shape)
and a ReadAttributes(field int) string
. That would not waste the effort interpreting all the fields that the user will not be interested in later.
Definitely looks like I reasonable feature, thanks for the input and suggestions, @fawick! I think a good way of doing this would be to create a SequentialReader
interface with the methods Next() (int, error)
ReadShape() (Shape, error)
and ReadAttributes() []string
. This would allow the current Reader
to conform to the new interface as well. The ReadAttributes()
method would return all attributes to avoid confusion when, for example, a user would try to access attribute n+1
before attribute n
and receive an error because going back is illegal.
However, this would probably mean refactoring some of the methods in the current reader. In the implementation as it is now the Next()
method reads the Shape
to memory, a more idiomatic approach would perhaps be to only jump to the next position and use ReadShape()
to fetch it. This would allow for traversing whole shape files without actually reading every object. Returning errors should also be beneficial to allow for more precise error checking. shp.Reader.ReadAttribute
will still exist in the current reader for random access, a good idea would also be to implement a random access method to fetch shape objects.
To avoid breaking any current implementation I suggest that the current Shape()
method should be rewritten to actually read the shape but still return the same values (i.e. no error checking) but be marked as deprecated in favor of the new ReadShape()
method.
I'm currently busy with other projects at the moment but should have some time next week to look at the implementation aspects of this.
Thanks @jonas-p! I think the type SequentialReader interface
is a very good idea.
I am going to fork go-shp and try to give the ZipReader
a shot. I write the code so that ZipReader
implements the SequentialReader
interface. In case I make good progress I will come up with a Pull Request for that.
(As another idea: There could be another implementation of SequentialReader
that combines three io.ReadCloser
instances: func NewFromReadClosers(shp, shx, dbf io.ReadCloser) SequentialReader
. This would allow getting the files from someplace else than a disk altogether. This is probably worth an issue on its own, but it might prove to be beneficial even during the implementation of ZipReader
, as then NewZipReader
would simply open the ZIP, call zip.OpenFile
on the compressed io.ReadCloser
and then call NewFromReadClosers() on them.)
One more thing: Probably shpSequentialReader.Next()
needs to be renamed as well, as its contract differs from the existing shp.Reader.Next()
. I suggest the name shp.SequentialReader.Advance()
and declaring shp.Reader.Next()as deprecated in the same manner as for
shp.Reader.Shape()` above.
One could add func Err() error
to interface SequentialReader
to return the last non-EOF error encountered (as e.g. done by bufio.Scanner.Err()
). When paired with an func Index() int
that returns the current shape number as shp.Reader.Shape() (int, shp.Shape)
does, we could even continue to use func Next() bool
which would let the for
loops continue to be pretty and idiomatic:
var sr shp.SequentialReader
for sr.Next() {
s, err := sr.ReadShape()
a := sr.ReadAttributes()
handleShapeSomehow(s, a[42])
}
if err := sr.Err(); err != nil {
// error handling
}
Next would return true only if there neither was an error when reading the shape nor when reading the attributes. The count could even be handled by the client when needed by manually incrementing an int
in the for
-loop, removing the need for func Index() int
.
Sorry for creating so many responses to the issue, but:
I would like to make an argument for having func Attribute(n int) string
and func Fields() []Field
in the interface. For convenience, one could then write the accessor for "all fields at the same time" as a function on the interface type instead of a method as part of the interface (which we called shp.Sequential.ReadAttributes()
above):
// Attributes() returns all attributes of the shape that sr was last advanced to.
func Attributes(sr SequentialReader) []string {
if sr.Err() != nil {
return nil
}
s := make([]string, len(sr.Fields()))
for i := range s {
s[i] = sr.ReadAttribute(i)
}
return s
}
I imagine that most implementations of SequentialReader will internally store a []byte
-slice with all the field values of a the DBF row the corresponds to the last read shape but interpret and parse the sequence of bytes for an attribute n only when Attribute(n)
is called.
Good points! I actually believe I was looking at bufio.Scanner
when creating the lib, so keeping Next() bool
for advancement and implementing Err() error
is a good idea.
I like the way of handling attributes as well. As long as it makes accessing attributes for the row non-sequentially possible. For example, calling Attribute(2)
before Attribute(1)
should behave the same as calling Attribute(1)
before Attribute(2)
. This should not be a problem if we keep a []byte
-slice internally as you mentioned and populate it "on the fly".
// Example
var dbfRow []byte
var dbfRowFieldsRead int
func Attribute(n int) (string, error) {
if dbfFile == nil {
// Return error stating no dbf file is present
}
if n > numberOfFields {
// Return index out of range error
}
if n > dbfRowFieldsRead {
// Parse fields in the dbf file up to (inclusive) n and add them to dbfRow
}
// Return the field from dbfRow
return fieldFromRow(n)
}
func Next() bool {
dbfRowFieldsRead = 0
// Advance cursors in the the shp and dbf
// ...
}
This implementation example should allow convenience methods such as Attributes() ([]string, err)
to be easily created. Furthermore, Fields()
should also return an error if, for example, no dbf-file is present.
Would Shape()
and Attribute(int)
really need an error
in the returned values? Given that SequentialReader
is considered for being used in a for
-loop anyway, one could argue that Err()
should always be called after a loop and simply set the last error of the SequentialReader
during the call to Attribute
or Shape
or even Fields
.
The good thing about that, is that the interface is then already implemented by the current shp.Reader
-- expect for Attribute(int)
(which itself is trivially implemented by calling the existing random-access shp.Reader.ReadAttributes(row, field)
with the current shape index count) and Err()
.
You can follow my progress here: https://github.com/fawick/go-shp/tree/zipreader.
Here is the interface that I have so far.
// SequentialReader is the interface that allows reading shapes and attributes one after another. It also embeds io.Closer.
type SequentialReader interface {
// Close() frees the resources allocated by the SequentialReader.
io.Closer
// Next() tries to advance the reading by one shape and one attribute row and returns in
// case no error other than io.EOF was encountered.
Next() bool
// Shape returns the index and the last read shape.
Shape() (int, Shape)
// Attribute returns the value of the n-th attribute in the current row.
Attribute(field int) string
// Fields returns the fields of the database.
Fields() []Field
// Err returns the last non-EOF error encountered.
Err() error
}
I am currently refactoring your test cases TestRead...
so that I can reuse all the expected points in tests that do sequential reads.
You're right, let's keep it as close to the current implementation as possible to avoid breaking stuff. I was thinking of a use case where you would want to return early but with Err()
it's still possible to do so. Anyways, the interface looks good so far!
Just to le you know: I'll be abroad for a couple of days, so progress will be slower. I am going to catch up when I'm back.
It's not much that is missing now, just the internals for the seqReader
, the unexported implementation for the SequentialReader
which is used internally by ZipReader
.
The ZipReader
itself is done, including a test case which zips some of the existing test_files and then tests its contents in the same manner as reader_test does. I might refactor the existing tests some more to reduce the amount of double code, if that's alright with you.
No worries, great job so far. Sounds like a great idea to refactor some of the test code. I'll probably start working on some random access methods soon, but I'll create a separate issue for that.
Hej!
I found during testing that the test file for the MultiPatch shape type seems to be invalid, i.e. it is lacking the Mmin, Mmax and Marray data, leading to an early io.EOF.
Your test cases did not cover that situation, as Multipatch.read stoically ignores all errors during the binary reading. In https://github.com/fawick/go-shp/commit/bafebaf499c528b58faa4aa8d1545b5ff057bee8 I tool the liberty to augment the Reader and the tests.
I can easily fix the multipatch.shp by adding the missing data in forms of zeros, but wanted to ask whether you would prefer another approach.
I usually encounter shapefiles that come as a ZIP which contains the .shp/.shx/.dbf. It would be great to have the possibility of iterating over the shapes and the database without uncompressing the ZIP beforehand.
This might prove to be non-trivial, as the
io.ReadCloser
that is returned byzip.File.Open
does not support seeking. It might be okay for the shapes, as they are read sequentially, anyway, but the random-access to the DBF will not be possible.As the typical use case might be iterating over all shapes and their individual attributes, one might restrict the access to the attributes to the ones belonging to the last read shape.
I imagine something like a
type ZipReader
that hasfunc (z *ZipReader) Next()
andfunc (z *ZipReader) ShapeFields() (row int, shape Shape, fieldValues []string)
which can be used in a loop in the same way the methods ofshp.Reader
are.Internally
Next()
would read the .shp one shape and the .dbf one row at a time, with the latter internally buffered as a plain[]byte
-slice which is later interpreted in the same way byshp.ZipReader.ShapeFields
asshp.Reader.ReadAttribute
does with the single attributes now.