locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
246 stars 45 forks source link

Reading local raster data in windows 10 #356

Open Mmoncadaisla opened 5 years ago

Mmoncadaisla commented 5 years ago

Hello,

I've succeded trying to read the data from the examples on the RFs doc site (ie: https://rasterframes.io/raster-read.html). However, i'm uncapable of reading the same data when located in my local computer.

I am using a Windows 10 OS, python 3.7 with anaconda distribution and jupyter notebook.

I've tryed out different ways of typing the uri without success. I hope the community can help me out since i'm totally stuck on the basics here.

Furthermore, i've succeded reading the data with GDAL (w/o pyspark).

Code here: https://pastebin.com/ijiMfhwU

vpipkt commented 5 years ago

I was helping out too and things we tried:

1) windows server 2016 ec2 instance without gdal, with conda environment. Able to read a local file at 'file:///C:/foo/bar.tif; 2) try with gdal:/ (as in pastebin example) 3) try uri as file:///C:/foo/bar etc

It seems that it may be down to the GDAL reader with windows local paths. the jvm reader seemed to work as in item 1 above. It may also be the ENVI format itself.

This one is fairly difficult to debug also just due to my dev environment being mac / linux.

I also somewhat wonder if the issue may be in the GT code?

vpipkt commented 5 years ago

Okay here is my hypothesis and proposed fix(es).

If the user is on Windows and pointing at a local file in such a fashion that would use the GDAL reader, this code will be used to parse the URI string. The comment here claims that VSIPath doesn't like single slash file:/pathso removes it. But the file:/ (single slash) seems to be exactly what is needed for correct extraction of windows file path from the WINDOWS_LOCAL_PATH_PATTERN regex here.

Furthermore if we remove the file:/ from a windows URI the VSIPath's SCHEME_PATTERN regex is going to incorrectly interpret the drive letter as the scheme.

My proposed fixes:

  1. Remove the tweaked logic entirely from GDALRasterSource. That would allow the user to pass in file:/C:\Foo\Bar\file.dat which will result in VSIPath().vsiPath equal to C:\Foo\Bar\file.dat

  2. Change to geotrellis contrib and RF:

    1. Change the WINDOWS_LOCAL_PATH_PATTERN to (?<=(?:(\/){2})).+ to remove the scheme correctly.

    2. change the tweaked logic from the GDALRasterSource . Instead of removing file:/ entirely, replace it with the double slash so: file:/C:\Foo\Bar\file.dat -> file://C:\Foo\Bar\file.dat

@metasim who on the GT side do we need to engage?

vpipkt commented 5 years ago

@MiguelNOX I published a snapshot / dev of the branch here built as a whl to the test pypi instance. Can you try installing in your environment?

pip install --extra-index-url https://test.pypi.org/simple/ pyrasterframes==0.8.2.dev0

And let us know if it works with specifying a single slash path like so: spark.read.raster(r'file:/D:\path\to\raster')

Mmoncadaisla commented 5 years ago

@vpipkt After installing it in my environment and removing the older version (had to necesarily remove it in order to let it run properly) i get the next error:

https://pastebin.com/6aaRXHW7

Thank you again for your dedication, i hope we can solve this!

vpipkt commented 5 years ago

Ok I think we now have isolated the following:

1) changes to URI string interpretation discussed above 2) improved discussion of windows gdal installation in the documentation;

Rationale for 2 is seen in latest attempts by @MiguelNOX to read envi file. Python session has access to GDAL, and attempts to read envi file are handled by the JVM GeoTiff reader, meaning (probably) that the underlying call to GDALWarp.get_version_info has failed and gal is not available.

Trying to get an output of the same script from @MiguelNOX where we also see the output from pyrasterframes.utils.gdal_version which would be more definitive.

Mmoncadaisla commented 5 years ago

You are totally right, when i try to get the output from pyrasterframes.utils.gdal_version i get not available as output. in my local computer.

EDIT:

I've just tried using pyrasterframes through google colab on this tif file from the RF's doc page: B02.tif

I've succeeded at reading the file with just gdal through the colab notebook. I have also tested spark on it and it works properly. However, i still get trouble reading the file through pyrasterframes as shown below in the pastebin.

https://pastebin.com/Cuv66r5e

In this case, when running from pyrasterframes.utils import gdal_version print(gdal_version()) i get GDAL 2.2.3, released 2017/11/20GDAL 2.2.3, released 2017/11/20 as output.

vpipkt commented 5 years ago

Well that is good to know. Hmm @metasim is there someone that @MiguelNOX could reach out to for help with installing the gdalwarp bindings correctly in Windows?

On Tue, Sep 24, 2019 at 3:49 AM MiguelNOX notifications@github.com wrote:

You are totally right, when i try to get the output from pyrasterframes.utils.gdal_version i get not available as output.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/locationtech/rasterframes/issues/356?email_source=notifications&email_token=AB3P4L4SZJVQXJCWFKU3SOLQLHBB3A5CNFSM4IYCRA6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7NNZ3Y#issuecomment-534437103, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3P4L24O6NIBFBAYNBG5P3QLHBB3ANCNFSM4IYCRA6A .

metasim commented 5 years ago

@MiguelNOX You could try the GeoTrellis Gitter channel. On Windows it's often that the GDAL shared libraries aren't in the PATH variable.

robknapen commented 4 years ago

For reference a short recap on how to read envi files located on a Windows network share, via gdal (2.4.2), in pyrasterframes on macOS (10.15).

Currently on macOS you really need gdal installed via Homebrew. It does not work with gdal installed with anaconda (and probably other ways either).

Also, for envi files gdal wants the .dat, or .bil file, not the .hdr one.

In our company many users make use of Microsoft Windows and there are standard folder shares. On macOS if you mount those they end up under /Volumes/. With that done it appears to be enough for gdal to reference them as 'file:///Volumes/'

Here is a little (python) example that works in my case:

import pyrasterframes
from pyrasterframes.rasterfunctions import *
from pyrasterframes.utils import create_rf_spark_session

spark = create_rf_spark_session(**{
    'spark.driver.extraJavaOptions': '-Djava.library.path=/Users/.../opt/anaconda3/lib'
})

file = 'file:///Volumes/dfs-root/.../ndvi/2019/ndvi20190717_csa_10m.dat'
# file = 'file:///Volumes/dfs-root/.../Sentinel2/2019/20190717/S2B_L2A_20190717_B01.bil'
df = spark.read.raster(file)

Passing the anaconda lib path to spark might not strictly be needed anymore. @metasim and @vpipkt probably know :-)

vpipkt commented 4 years ago

@robknapen thanks very much for that. And @Mmoncadaisla take a look and see if this is of assistance for you.

Mmoncadaisla commented 4 years ago

Hello @robknapen thank you very much for your comments, i am now running out of time with a project but i will check it out asap and leave the feedback here.

However, i have also installed Ubuntu 18.04 in another pc so that i am able to run pyrasterframes without any trouble in case your solution wouldn't work out for me.

Thank you very much @vpipkt as well for your dedication

tieuthienvn1987 commented 4 years ago

I have an error, too. I read a Landsat 8 image file, then created a catalog and displayed the following: landsat=[r'D:\Graduation_thesis\data\LC08_L1TP_014032_20190720_20190731_01_T1\LC08_L1TP_014032_20190720_20190731_01T1{b}.TIF' for b in bands] catalog = ','.join(bands) + '\n' + ','.join(landsat) df = (spark.read.raster(catalog, bands) display(df) When I run this code, I got a very long error, Focus is: Caused by: java.lang.IllegalArgumentException: Illegal character in opaque part at index 2: D:\Graduation_thesis\data\LC08_L1TP_014032_20190720_20190731_01_T1\LC08_L1TP_014032_20190720_20190731_01T1{b}.TIF Please help me. Many thanks

metasim commented 4 years ago

@tieuthienvn1987 Should r'D:\Graduation_thesis\data\... be f'D:\Graduation_thesis\data\...?

tieuthienvn1987 commented 4 years ago

@tieuthienvn1987 Should r'D:\Graduation_thesis\data\... be f'D:\Graduation_thesis\data\...? You mean f'D:\Graduation_thesis\data......? I used r'D:\Graduation_thesis\data...' but when I run code, I got an error