Feature request: Support reading shp/shx/dbf bundles from a ZIP archive

jonas-p / go-shp

Go library for reading and writing ESRI Shapefiles. Pure Golang implementation based on the ESRI Shapefile technical description.

MIT License

255 stars 67 forks source link

Feature request: Support reading shp/shx/dbf bundles from a ZIP archive #4

Closed fawick closed 7 years ago

fawick commented 8 years ago

I usually encounter shapefiles that come as a ZIP which contains the .shp/.shx/.dbf. It would be great to have the possibility of iterating over the shapes and the database without uncompressing the ZIP beforehand.

This might prove to be non-trivial, as the io.ReadCloser that is returned by zip.File.Open does not support seeking. It might be okay for the shapes, as they are read sequentially, anyway, but the random-access to the DBF will not be possible.

As the typical use case might be iterating over all shapes and their individual attributes, one might restrict the access to the attributes to the ones belonging to the last read shape.

I imagine something like a type ZipReader that has func (z *ZipReader) Next() and func (z *ZipReader) ShapeFields() (row int, shape Shape, fieldValues []string) which can be used in a loop in the same way the methods of shp.Reader are.

Internally Next() would read the .shp one shape and the .dbf one row at a time, with the latter internally buffered as a plain []byte-slice which is later interpreted in the same way by shp.ZipReader.ShapeFields as shp.Reader.ReadAttribute does with the single attributes now.

fawick commented 8 years ago

Alternatively, instead of ShapeFields, the ZipReader might have a Shape() (int, Shape) and a ReadAttributes(field int) string. That would not waste the effort interpreting all the fields that the user will not be interested in later.

jonas-p commented 8 years ago

Definitely looks like I reasonable feature, thanks for the input and suggestions, @fawick! I think a good way of doing this would be to create a SequentialReader interface with the methods Next() (int, error) ReadShape() (Shape, error) and ReadAttributes() []string. This would allow the current Reader to conform to the new interface as well. The ReadAttributes() method would return all attributes to avoid confusion when, for example, a user would try to access attribute n+1 before attribute n and receive an error because going back is illegal.

However, this would probably mean refactoring some of the methods in the current reader. In the implementation as it is now the Next() method reads the Shape to memory, a more idiomatic approach would perhaps be to only jump to the next position and use ReadShape() to fetch it. This would allow for traversing whole shape files without actually reading every object. Returning errors should also be beneficial to allow for more precise error checking. shp.Reader.ReadAttribute will still exist in the current reader for random access, a good idea would also be to implement a random access method to fetch shape objects.

To avoid breaking any current implementation I suggest that the current Shape() method should be rewritten to actually read the shape but still return the same values (i.e. no error checking) but be marked as deprecated in favor of the new ReadShape() method.

I'm currently busy with other projects at the moment but should have some time next week to look at the implementation aspects of this.

fawick commented 8 years ago

Thanks @jonas-p! I think the type SequentialReader interface is a very good idea.

I am going to fork go-shp and try to give the ZipReader a shot. I write the code so that ZipReader implements the SequentialReader interface. In case I make good progress I will come up with a Pull Request for that.

(As another idea: There could be another implementation of SequentialReader that combines three io.ReadCloser instances: func NewFromReadClosers(shp, shx, dbf io.ReadCloser) SequentialReader. This would allow getting the files from someplace else than a disk altogether. This is probably worth an issue on its own, but it might prove to be beneficial even during the implementation of ZipReader, as then NewZipReader would simply open the ZIP, call zip.OpenFile on the compressed .[shp|shx|dxf] to obtain three io.ReadCloser and then call NewFromReadClosers() on them.)

fawick commented 8 years ago

One more thing: Probably shpSequentialReader.Next() needs to be renamed as well, as its contract differs from the existing shp.Reader.Next(). I suggest the name shp.SequentialReader.Advance() and declaring shp.Reader.Next()as deprecated in the same manner as forshp.Reader.Shape()` above.

One could add func Err() error to interface SequentialReader to return the last non-EOF error encountered (as e.g. done by bufio.Scanner.Err()). When paired with an func Index() int that returns the current shape number as shp.Reader.Shape() (int, shp.Shape) does, we could even continue to use func Next() bool which would let the for loops continue to be pretty and idiomatic:

var sr shp.SequentialReader
for sr.Next() {
    s, err := sr.ReadShape()
    a := sr.ReadAttributes()
    handleShapeSomehow(s, a[42])
} 
if err := sr.Err(); err != nil {
    // error handling
}

Next would return true only if there neither was an error when reading the shape nor when reading the attributes. The count could even be handled by the client when needed by manually incrementing an int in the for-loop, removing the need for func Index() int.

fawick commented 8 years ago

Sorry for creating so many responses to the issue, but:

I would like to make an argument for having func Attribute(n int) string and func Fields() []Field in the interface. For convenience, one could then write the accessor for "all fields at the same time" as a function on the interface type instead of a method as part of the interface (which we called shp.Sequential.ReadAttributes() above):

// Attributes() returns all attributes of the shape that sr was last advanced to.
func Attributes(sr SequentialReader)  []string {
    if sr.Err() != nil {
        return nil
    }
    s := make([]string, len(sr.Fields()))
    for i := range s {
        s[i] = sr.ReadAttribute(i)
    }
    return s
}

I imagine that most implementations of SequentialReader will internally store a []byte-slice with all the field values of a the DBF row the corresponds to the last read shape but interpret and parse the sequence of bytes for an attribute n only when Attribute(n) is called.

jonas-p commented 8 years ago

Good points! I actually believe I was looking at bufio.Scanner when creating the lib, so keeping Next() bool for advancement and implementing Err() error is a good idea.

I like the way of handling attributes as well. As long as it makes accessing attributes for the row non-sequentially possible. For example, calling Attribute(2) before Attribute(1) should behave the same as calling Attribute(1) before Attribute(2). This should not be a problem if we keep a []byte-slice internally as you mentioned and populate it "on the fly".

// Example
var dbfRow []byte
var dbfRowFieldsRead int

func Attribute(n int) (string, error) {
    if dbfFile == nil {
       // Return error stating no dbf file is present
    }

    if n > numberOfFields {
        // Return index out of range error
    }

    if n > dbfRowFieldsRead {
        // Parse fields in the dbf file up to (inclusive) n and add them to dbfRow
    }
    // Return the field from dbfRow
    return fieldFromRow(n)
}

func Next() bool {
    dbfRowFieldsRead = 0

    // Advance cursors in the the shp and dbf
    // ...
}

This implementation example should allow convenience methods such as Attributes() ([]string, err) to be easily created. Furthermore, Fields() should also return an error if, for example, no dbf-file is present.

fawick commented 8 years ago

Would Shape() and Attribute(int) really need an error in the returned values? Given that SequentialReader is considered for being used in a for-loop anyway, one could argue that Err() should always be called after a loop and simply set the last error of the SequentialReader during the call to Attribute or Shape or even Fields.

The good thing about that, is that the interface is then already implemented by the current shp.Reader -- expect for Attribute(int) (which itself is trivially implemented by calling the existing random-access shp.Reader.ReadAttributes(row, field) with the current shape index count) and Err().

You can follow my progress here: https://github.com/fawick/go-shp/tree/zipreader.

Here is the interface that I have so far.

// SequentialReader is the interface that allows reading shapes and attributes one after another. It also embeds io.Closer.
type SequentialReader interface {
        // Close() frees the resources allocated by the SequentialReader.
        io.Closer

        // Next() tries to advance the reading by one shape and one attribute row and returns in
        // case no error other than io.EOF was encountered.
        Next() bool

        // Shape returns the index and the last read shape.
        Shape() (int, Shape)

        // Attribute returns the value of the n-th attribute in the current row.
        Attribute(field int) string

        // Fields returns the fields of the database.
        Fields() []Field

        // Err returns the last non-EOF error encountered.
        Err() error
}

I am currently refactoring your test cases TestRead... so that I can reuse all the expected points in tests that do sequential reads.

jonas-p commented 8 years ago

You're right, let's keep it as close to the current implementation as possible to avoid breaking stuff. I was thinking of a use case where you would want to return early but with Err() it's still possible to do so. Anyways, the interface looks good so far!

fawick commented 8 years ago

Just to le you know: I'll be abroad for a couple of days, so progress will be slower. I am going to catch up when I'm back.

It's not much that is missing now, just the internals for the seqReader, the unexported implementation for the SequentialReader which is used internally by ZipReader.

The ZipReader itself is done, including a test case which zips some of the existing test_files and then tests its contents in the same manner as reader_test does. I might refactor the existing tests some more to reduce the amount of double code, if that's alright with you.

jonas-p commented 8 years ago

No worries, great job so far. Sounds like a great idea to refactor some of the test code. I'll probably start working on some random access methods soon, but I'll create a separate issue for that.

fawick commented 7 years ago

Hej!

I found during testing that the test file for the MultiPatch shape type seems to be invalid, i.e. it is lacking the Mmin, Mmax and Marray data, leading to an early io.EOF.

Your test cases did not cover that situation, as Multipatch.read stoically ignores all errors during the binary reading. In https://github.com/fawick/go-shp/commit/bafebaf499c528b58faa4aa8d1545b5ff057bee8 I tool the liberty to augment the Reader and the tests.

I can easily fix the multipatch.shp by adding the missing data in forms of zeros, but wanted to ask whether you would prefer another approach.