locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
240 stars 46 forks source link

How to export the result of "Unsupervised Machine Learning" as GeoTIFFs? #539

Closed JenniferYingyiWu2020 closed 3 years ago

JenniferYingyiWu2020 commented 3 years ago

Hi, I have seen the "Visualize Prediction" section that is under the "Unsupervised Machine Learning", and the resulting output is shown as below: 1 Also, I have read "GeoTIFFs" section of "Writing Raster Data", and the code "spark_df.write.geotiff" exports the GeoTIFFs. 2 Now, my issue is that I'd like to export the "Visualize Prediction" result of "Unsupervised Machine Learning" as GeoTIFFs, but how to implement it? Could you pls give me some suggestions?

vpipkt commented 3 years ago

Is there a specific bug or error that you are encountering?

JenniferYingyiWu2020 commented 3 years ago

Hi, I have used the codes “retiled.select('prediction', 'crs', 'extent').write.geotiff(, crs=)” on Supervised Machine Learning. 1 2 However, the output result is complete wrong. It is shown as below: Figure_1 My modified codes are followings: 3 So, could you pls give me some suggestions why my output result using above codes generated a meaningless .tiff (shown as above)? Moreover, could you pls tell me how to resolve the issue? Thanks! Note: By the way, the image dataset I used on supervised machine learning is "s3://s22s-test-geotiffs/luray_snp/{}.tif".

JenniferYingyiWu2020 commented 3 years ago

Hi, After my modification the codes for output the result of GeoTIFFs as below, the the output result is still complete wrong. 1 image

vpipkt commented 3 years ago

Can you share some details about your output if you omit the raster_dimensions parameter?

Perhaps share the output of gdalinfo for the resulting output file? And if it is small enough attach to the issue?

JenniferYingyiWu2020 commented 3 years ago

Hi, I have adopted your suggestion and omitted the "raster_dimensions" parameter, however, "java.lang.OutOfMemoryError: Java heap space" took place. Furthermore, I have replaced "raster_dimensions=(558, 507)" with "raster_dimensions=(5580, 5070)" or with "raster_dimensions=(1558, 1507)", but the "java.lang.OutOfMemoryError: Java heap space" errors also appeared. A new project named "rasterframes-GeoTIFFs" has been created on my Github page. I have uploaded the "unsupervised machine learning" and "supervised machine learning (https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/machine-learning/supervised_machine_learning.py)" codes, also the error logs (https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/tree/main/error-logs) while changing the "raster_dimensions". The show output result of GeoTIFFs is: "https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/show-output-result/supervised-machine-learning/show.png". 1

JenniferYingyiWu2020 commented 3 years ago

Hi, Lastly, the output of gdalinfo for the resulting output .tiff generated by "supervised machine learning" is below. In that case, the "raster_dimensions" parameter is "raster_dimensions=(558, 507)". 1

Warning 1: TIFFReadDirectory:Sum of Photometric type-related color channels and ExtraSamples doesn't match SamplesPerPixel. Defining non-color channels as ExtraSamples.

'Driver: GTiff/GeoTIFF\nFiles: /tmp/geotiff-supervised-machine-learning.tif\nSize is 558, 507\nCoordinate System is:\nGEOGCRS["WGS 84",\n DATUM["World Geodetic System 1984",\n ELLIPSOID["WGS 84",6378137,298.257223563,\n LENGTHUNIT["metre",1]]],\n PRIMEM["Greenwich",0,\n ANGLEUNIT["degree",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS["geodetic latitude (Lat)",north,\n ORDER[1],\n ANGLEUNIT["degree",0.0174532925199433]],\n AXIS["geodetic longitude (Lon)",east,\n ORDER[2],\n ANGLEUNIT["degree",0.0174532925199433]],\n USAGE[\n SCOPE["unknown"],\n AREA["World"],\n BBOX[-90,-180,90,180]],\n ID["EPSG",4326]]\nData axis to CRS axis mapping: 2,1\nOrigin = (-78.714123109391835,38.800547298901463)\nPixel Size = (0.002295141985034,-0.001343897129705)\nMetadata:\n AREA_OR_POINT=Area\n version=0.9.0\nImage Structure Metadata:\n INTERLEAVE=BAND\nCorner Coordinates:\nUpper Left ( -78.7141231, 38.8005473) ( 78d42\'50.84"W, 38d48\' 1.97"N)\nLower Left ( -78.7141231, 38.1191915) ( 78d42\'50.84"W, 38d 7\' 9.09"N)\nUpper Right ( -77.4334339, 38.8005473) ( 77d26\' 0.36"W, 38d48\' 1.97"N)\nLower Right ( -77.4334339, 38.1191915) ( 77d26\' 0.36"W, 38d 7\' 9.09"N)\nCenter ( -78.0737785, 38.4598694) ( 78d 4\'25.60"W, 38d27\'35.53"N)\nBand 1 Block=256x256 Type=Float64, ColorInterp=Red\n NoData Value=nan\n Metadata:\n RF_COL=prediction\nBand 2 Block=256x256 Type=Float64, ColorInterp=Green\n NoData Value=nan\n Metadata:\n RF_COL=red\nBand 3 Block=256x256 Type=Float64, ColorInterp=Blue\n NoData Value=nan\n Metadata:\n RF_COL=grn\nBand 4 Block=256x256 Type=Float64, ColorInterp=Undefined\n NoData Value=nan\n Metadata:\n RF_COL=blu\n'

(https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/show-output-result/supervised-machine-learning/gdalinfo.txt)

vpipkt commented 3 years ago

Please refer to extensive discussion on the Gitter channel for more detail.

Here is the reason you see the "blocks" of output in the GeoTiff written.

At line 101 you join the input catalog named df with the label_df, and so filter only to rows containing these labels.

Here is a visual of an input band (greyscale), the output prediction GeoTiff (viridis color scheme) and the label GeoJSON shapes (pink).

image

The tiles that are written out at line 225 are still filtered by the join with the label_df.

A suggested alternative would be something like the below. A further refinement would be to apply masking to the data, as on lines 113-120 before the join.

model.transform(df).groupBy('extent', 'crs') \
    .agg(
    rf_assemble_tile('column_index', 'row_index', 'prediction', tile_size, tile_size).alias('prediction'),
    rf_assemble_tile('column_index', 'row_index', 'B04', tile_size, tile_size).alias('red'),
    rf_assemble_tile('column_index', 'row_index', 'B03', tile_size, tile_size).alias('grn'),
    rf_assemble_tile('column_index', 'row_index', 'B02', tile_size, tile_size).alias('blu')
   ) \
  .write.geotiff(outfile, crs=crs, raster_dimensions=(1830//4, 1830//4)))

This looks like so:

image

Of course, there is a conceptual with downsampling a discrete classification result like this, which is that the result at any pixel location will be somewhat arbitrary. As noted in the docs, the GeoTIFF writer uses Bilinear resampling.