mapbox / mapnik-omnivore

Node module that returns metadata about spatial files.
44 stars 19 forks source link

VRT source filename paths #10

Open GretaCB opened 10 years ago

GretaCB commented 10 years ago

In reference to this issue I ran into Friday, I'm wondering how this potential filepath issue can best be managed/error handled. Right now, the path is relative to the VRT file: <SourceFilename relativeToVRT="1">sample.tif</SourceFilename>

@hrwgc , I spoke with @yhahn about this and he mentioned that in the past this was a pain, because you had to always input the absolute path. Any thoughts on a better way to control the source filename in a vrt?

@springmeyer Is there something that can be done on the mapnik/gdal side?

On the mapnik-omnivore side, a potential strategy to handle this possible file error would look something like...

brandonreavis commented 10 years ago

One thing you might want to look into is the getFileList() function in GDAL.

> var ds2 = gdal.open('test/data/sample.vrt');
> ds.getFileList();
[ 'test/data/sample.vrt',
  '/home/vagrant/repositories/node-gdal/test/data/sample.tif' ]
> var ds2 = gdal.open('test/data/sample_invalid.vrt');
> ds2.getFileList()
[ 'test/data/sample_invalid.vrt' ]

I haven't used it before but it seems like it would help. From the simple test above it seems to turn all relative paths into absolute ones and if the file referenced in the VRT doesn't exist it doesn't return it.

Unfortunately, this doesn't explicitly tell you which filenames are bad, which leaves you still needing to scan through the file to see what files are referenced. Maybe a generic error like "VRT file doesn't reference any existing files on the filesystem" would be fine instead though?

If a VRT file references 10 raster files and 2 filenames are bad, I would imagine getStatistics() would still work and it would just use the 8 valid rasters. For your application you might want to be able to throw an error saying which 2 filenames are bad though instead of pretending they arent there.

hrwgc commented 10 years ago

@GretaCB I'm going to focus on how we handle input files in TileMill -- let me know if your context is different. To my knowledge, relative path vrts fail in tilemill because we symlink the VRT to the tilemill project's layers directory. Relative paths in the symlinked VRT are no longer valid, because mapnik looks for the tiffs in the layers directory, instead of in the symlinked VRT's actual location.

One way to change this would be to treat VRTs in a similar manner to multi-file vector sources, like a shapefile, where each of the component files is symlinked to the layers directory. For a VRT with relative paths, that might look like:

  1. symlink all source tiffs named in source VRT to mapnik layers directory
  2. symlink source vrt.

For a vrt with absolute paths, we'd only need to symlink the source vrt, since the absolute paths would remain valid.

Without making any changes on tilemill/mapnik end, the easiest way to deal with the issue is to always use VRTs with absolute paths. I do this by making vrt with gdalbuildvrt and specifying source images on command line using absolute paths (gdalbuildvrt mynewvrt.vrt /path/to/tiffs/*tif). You can't make the VRT with gdal_translate, as it does not preserve absolute paths (!!!).

yhahn commented 10 years ago

@hrwgc assuming there's no symlinking involved, have you run into any other issues with relative paths?

hrwgc commented 10 years ago

If a VRT references GeoTIFFs with external overviews (which are $tiff.tif.ovr files produced in same directory as $tiff) , mapnik doesn't seem to know they exist or use them. To my knowledge, this behavior occurs regardless of absolute or relative paths in VRT.

springmeyer commented 10 years ago

My guess is that its only tilemill symlinking/millstone that breaks vrt's and not Mapnik. But I'll change my mind if you can provide a testcase showing otherwise @hrwgc :)

hrwgc commented 10 years ago

@springmeyer I'll defer to your expertise here :)

GretaCB commented 10 years ago

Thanks @brandonreavis , ds.getFileList() is a great starting point for handling this error.

getStatistics() still errors out if one or more invalid tiffs exist in the VRT:

Error: /Users/mapbox/dev/mapnik-omnivore/test/data/vrt/sample.vrt, band 1: Failed to compute statistics, no valid pixels found in sampling.

So having just one invalid source in the VRT will cause the entire VRT to be invalid in my case.

[ '/Users/mapbox/dev/mapnik-omnivore/test/data/vrt/sample.vrt',
  '/Users/mapbox/dev/mapnik-omnivore/test/data/vrt/vrtTest/1-projected.tif' ]

Agreed, it would be ideal to list the specific sources that are invalid, but afraid there is no other way (that I can think of) to obtain these filenames except by parsing the xml for all SourceFilename elements and checking fs.exist for each. Also, failing VRT files that have any invalid sources filenames at all.

GretaCB commented 10 years ago

Thanks for your input @hrwgc ! I think regarding mapnik-omnivore, I want to be able to validate the VRT file properly, which would entail handling any errors in relation to mapnik/gdal and provide as much helpful information about the error as possible. And (best case) return metadata about the VRT file.

GretaCB commented 10 years ago

Ran gdalinfo on the vrt I referenced above and got the following:

mapboxs-MacBook-Pro-2:mapnik-omnivore mapbox$ gdalinfo ./test/data/vrt/sample.vrt
Driver: VRT/Virtual Raster
Files: ./test/data/vrt/sample.vrt
       /Users/mapbox/dev/mapnik-omnivore/./test/data/vrt/vrtTest/1-projected.tif
Size is 7976, 7237
Coordinate System is:
PROJCS["WGS 84 / Pseudo-Mercator",
    GEOGCS["WGS 84",
        DATUM["WGS_1984",
            SPHEROID["WGS 84",6378137,298.257223563,
                AUTHORITY["EPSG","7030"]],
            AUTHORITY["EPSG","6326"]],
        PRIMEM["Greenwich",0],
        UNIT["degree",0.0174532925199433],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Mercator_1SP"],
    PARAMETER["central_meridian",0],
    PARAMETER["scale_factor",1],
    PARAMETER["false_easting",0],
    PARAMETER["false_northing",0],
    UNIT["metre",1,
        AUTHORITY["EPSG","9001"]],
    EXTENSION["PROJ4","+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext  +no_defs"],
    AUTHORITY["EPSG","3857"]]
Origin = (6008775.072321455925703,2941109.730532602407038)
Pixel Size = (33.043018308499384,-33.043018308499384)
Corner Coordinates:
Upper Left  ( 6008775.072, 2941109.731) ( 53d58'39.88"E, 25d31'51.64"N)
Lower Left  ( 6008775.072, 2701977.407) ( 53d58'39.88"E, 23d34'38.08"N)
Upper Right ( 6272326.186, 2941109.731) ( 56d20'42.95"E, 25d31'51.64"N)
Lower Right ( 6272326.186, 2701977.407) ( 56d20'42.95"E, 23d34'38.08"N)
Center      ( 6140550.629, 2821543.569) ( 55d 9'41.42"E, 24d33'28.56"N)
Band 1 Block=128x128 Type=Byte, ColorInterp=Gray

So, GDAL itself is only picking up the one existing source 1-projected.tif and ignoring the others.

I then checked out the GDAL source for the GetFileList() function, and looks like the file list relies on this apoOverviews object. After searching the file, I'm not exactly sure where this apoOverviews object is instantiated. I'll keep digging. /cc @springmeyer @brandonreavis

brandonreavis commented 10 years ago

@GretaCB . You're brave haha. I started digging into the source for the same thing just a while ago and ...yeah... it's dense. I haven't come up with anything productive yet either. As far as I can tell, the check for whether or not the file exists is done in vrtsources.cpp file here.

But I am not sure how we could change node-gdal's ds.getFileList() method to return all of filenames mentioned in the VRT without heavily patching the gdal source. The alternative you've mentioned of parsing the file for SourceFilename tags might be the only alternative. I really wish there was a cleaner solution though.

GretaCB commented 10 years ago

@brandonreavis haha, yeah definitely a whirlwind going through the GDAL source code. Since getFileList() is part of the node-gdal API, it might be best to handle this error and parse the XML for the missing sources from within node-gdal. That way, any application accessing the function doesn't have to repeat the same code. I'd be happy to create a branch and throw it together.

brandonreavis commented 10 years ago

Yeah I am totally for that. I talked it over briefly with @brianreavis and we agree that the best way to go about it would probably be to make another function like getMissingFiles().

gdal.Dataset.prototype.getMissingFiles = function() {
    var existing_filenames = this.getFileList();
    var all_filenames = [];
    if (this.driver.description === 'VRT') { 
         //scan through file for <SourceFilename>
         //add filenames that dont exist in existing_filenames
    } 
    return all_filenames;
}
GretaCB commented 10 years ago

Awesome, would this be adding a getMissingFiles() function in C++ or javascript? I was thinking of adding the new function in this class: https://github.com/naturalatlas/node-gdal/blob/master/src/gdal_dataset.cpp

brandonreavis commented 10 years ago

So I changed my mind after looking into it more. It would be far from a simple fix and it would be prone to break with any changes to the GDAL source.

I am just going to run down a list of complications

So there are just too many moving parts, and it seems like it would be beyond the scope of the node-gdal binding to support. It really should be something that the GDAL dudes are taking care of... it's not really the binding's job.

Since god only knows when GDAL will support this functionality, if you decide you really need to know which filenames in a VRT are bad, I think writing a VRT parser purely with JS that lives outside of node-gdal is the way to go for now. Again, I wish there was a better solution.

brandonreavis commented 10 years ago

When you and @brianreavis turned verbose debugging on with gdal.verbose() you both got this message to show up as the first of many errors:

ERROR 4: '/Users/brianreavis/Repositories/node-gdal/test/data/sample.tif' does not exist in the file system,
and is not recognised as a supported dataset name.

The default behavior was to throw the last error, but with the commit above I (crudely) made the get/computeStatistics() methods throw any file errors instead of the other weird errors.

GretaCB commented 10 years ago

Oofdah, yeah that's a lot of hurdles. I updated my local node-gdal master to try out your last commit, but not seeing a difference in the errors. I'm seeing that particular Error 4 error when I comment out gdal quiet();, but it only includes the next invalid source and not all of them. Should there be particular errors I should be seeing based on your change?

I'm not too familiar with C++ and this might be crazy talk, but would there be a way to integrate a separate javascript VRT parser module into the rasterband.cpp file? For example, can a C++ class import a javascript module or somehow use it as a separate library? If so, it would ideally send the file path to the module and then just receive the list of missing filenames to spit back with the error message.

Also, would it be realistic to add this functionality in GDAL's source and do a pull request? cc @springmeyer @yhahn

GretaCB commented 10 years ago

Thinking it's probably best to chill on this issue for now, and maybe return to it later. I'm going to invalidate any VRTs that don't have ALL sources, and include the vague-ish error message that they're missing one or more sources. Thanks for falling down the dark tunnel with me @brandonreavis !

brandonreavis commented 10 years ago

You should be able to have gdal.quiet() uncommented and it should function like this:

Code:

var ds    = gdal.open('test/data/sample_invalid.vrt');
var band  = ds.bands.get(1)
var stats = band.getStatistics(false, true);

Result:

/home/vagrant/repositories/node-gdal/examples/vrt.js:5
var stats = band.getStatistics(false, true);
                 ^
Error: `/home/vagrant/repositories/node-gdal/test/data/sample_invalid.tif' does not exist in the file system,
and is not recognised as a supported dataset name.

    at Object.<anonymous> (/home/vagrant/repositories/node-gdal/examples/vrt.js:5:18)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:902:3

GDAL will only report the first file does not exist error that it notices, and thats the error that should be thrown in JS now. Is it throwing this error instead?

ERROR 1: test/data/sample_invalid.vrt, band 1: Failed to compute statistics, no valid pixels found in sampling.
GretaCB commented 10 years ago

After reinstalling https://github.com/naturalatlas/node-gdal/tarball/master, I'm getting the following error for an invalid VRT when calling band.getStatistics():

[Error: `/Users/mapbox/dev/mapnik-omnivore/test/data/vrt/vrtTest/2-projected.tif' does not exist in the file system, and is not recognised as a supported dataset name.]
brandonreavis commented 10 years ago

About patching the GDAL source: As far as I understand this is the function you would have to patch : VRTSimpleSource:GetFileList().

Simply commenting out these 6 lines would make it work how you want, but it makes node-gdal function differently than everyone else's GDAL code which wouldn't be good. That function is called for each SimpleSource in a VRT, so its not like you could just overload it and call it through the binding with JS. You would have to also overload all of the other VRT getFileList() methods, and yeah so begins the nightmare to support.

About a separate javascript VRT parser: I am not totally sure what you mean here. To make this special VRT parsing ability be a part of node-gdal it would require listing an XML parser as a dependency of node-gdal, so when you do an npm install it also installs all of the bazillion other dependencies that that the XML parser needs.

If you really want to be able to call something like ds.getMissingVRTFiles(), you should make a separate module that patches node-gdal with your functionality like:

var parser = require('somexmlparser');

function getMissingVRTFiles(){
    var driver_name = this.driver.description;
    if(driver_name != 'VRT') return [];

    var filename = this.description;
    //verify its actually a filename
    var missing_files = [];
    //parse the xml 

    return missing_files;
}

module.exports.applyPatch = function(gdal){
    gdal.Dataset.prototype.getMissingVRTFiles = getMissingVRTFiles;
}

then in your mapnik-omnivore module do something like:

var gdal = require('gdal');
require('gdal-patch-getmissingvrtfiles').applyPatch(gdal);
sgillies commented 10 years ago

@brandonreavis @GretaCB Late to the party, but how about on user input of a VRT we just gdal translate to a proper TIFF file and go from there?