Closed 81662550 closed 5 years ago
Hey @81662550 there is not enough information to tell smth about your case; my extra questions would be:
gdalinfo
, it would be better if you'll provide the entire typical gdalinfo
) what is this large tiff layout scheme?Thanks for your reply.
The gdalinfo is:
Driver: GTiff/GeoTIFF
Files: cs.tif
Size is 190401, 80401
Coordinate System is:
PROJCS["CGCS2000_3_Degree_GK_CM_114E",
GEOGCS["GCS_China_Geodetic_Coordinate_System_2000",
DATUM["D_China_2000",
SPHEROID["CGCS2000",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.017453292519943295]],
PROJECTION["Transverse_Mercator"],
PARAMETER["False_Easting",500000.0],
PARAMETER["False_Northing",0.0],
PARAMETER["Central_Meridian",114.0],
PARAMETER["Scale_Factor",1.0],
PARAMETER["Latitude_Of_Origin",0.0],
UNIT["Meter",1.0],
VERTCS["Yellow_Sea_1985",
VDATUM["Yellow_Sea_1985"],
PARAMETER["Vertical_Shift",0.0],
PARAMETER["Direction",1.0],
UNIT["Meter",1.0]]]
Origin = (391979.950000000011642,3141020.049999999813735)
Pixel Size = (0.100000000000000,-0.100000000000000)
Metadata:
AREA_OR_POINT=Area
Image Structure Metadata:
INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left ( 391979.950, 3141020.050) (112d53'52.65"E, 28d22'47.09"N)
Lower Left ( 391979.950, 3132979.950) (112d53'55.35"E, 28d18'25.96"N)
Upper Right ( 411020.050, 3141020.050) (113d 5'31.88"E, 28d22'52.24"N)
Lower Right ( 411020.050, 3132979.950) (113d 5'34.10"E, 28d18'31.11"N)
Center ( 401500.000, 3137000.000) (112d59'43.49"E, 28d20'39.23"N)
Band 1 Block=190401x1 Type=Byte, ColorInterp=Red
Band 2 Block=190401x1 Type=Byte, ColorInterp=Green
Band 3 Block=190401x1 Type=Byte, ColorInterp=Blue
the size of the single tiff is 46GB. I merged it from 153 small tiffs. And the spark-submit parameter:
spark-submit --master spark://master:7077 \
--class cutTile.PyramidTms \
--num-executors 29 \
--executor-memory 10G \
--executor-cores 5 /home/hadoop/apps/makeTile_jar/makeTile.jar \
--input hdfs://master:9000/cs.tif \
--output hdfs://master:9000/catalog/
@81662550 and what GeoTrellis version do you use?
@81662550 and what GeoTrellis version do you use?
2.3.1 with scala 2.11
How does geotrellis decide the parallelism when reading a relatively small size (6GB) TIFF with spatialMultiband(). When I process a 6GB TIFF on HDFS (block size 64MB), geotrellis starts only 16 tasks on two or three workers. I have 160 cores of the cluster. Therefore, most of the nodes are idle. Forcing to repartition will cost lots of time in the first few stages. What should I do if I want to make full use of the cluster to process different size of data?
Hey @81662550 from what I see you just have a badly formated tiff that consists of striped segments. Each GeoTiff consists of segements which all together can be understood as a geotiff segment layout.
We're walking through all the tiffs, collecting metadata, generating small windows and trying to pack segments into the genrated windows.
So the answer to the question 'why there are not a lot of windows and why is it so slow' - is because you have a striped tiff.
For 'square windows' it is extremely bad, since to generate a single 256x256 tile you'll have to fetch 256 190401x1
segments to fill in each window. It makes IO extremely inefficient.
I think we need to add it into docs finally. Also the same discussion can be found here: https://github.com/locationtech/geotrellis/issues/2819#issuecomment-434360609 (you can read the entire thread, since I'm explaining there what's going on).
I would also like to see the stages (not jobs) tab of the completed job as well.
Also I would like to see the actual job you're running (cutTile.PyramidTms
) since you deifnitely don't load the entire cluster, and this can be also the job implementation story.
@pomadchin Thanks for your suggestion.
All the workers are fine after I set the partitionBytes to a smaller size, for example, 16MB, during reading the Tiff file. Perhaps for the cluster, the default block size of 128MB produces too little chunks to make full use of all resources.
When I translate the TIF file to 256X256 blocks with gdal_translate, it saves about 25% time (155 seconds) to generate tiles up to level 18. Following is the stages tab for processing a 6.1GB file.
However, the other MPI with gdal implemented algorithm costs only 30 seconds. Is there something wrong? Do you have any suggestions to improve it?
Maybe geotrellis is more suitable for larger files.
Driver: GTiff/GeoTIFF
Files: small_cs_block.tif
Size is 50401, 40401
Coordinate System is:
PROJCS["CGCS2000_3_Degree_GK_CM_114E",
GEOGCS["GCS_China_Geodetic_Coordinate_System_2000",
DATUM["D_China_2000",
SPHEROID["CGCS2000",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.017453292519943295]],
PROJECTION["Transverse_Mercator"],
PARAMETER["False_Easting",500000.0],
PARAMETER["False_Northing",0.0],
PARAMETER["Central_Meridian",114.0],
PARAMETER["Scale_Factor",1.0],
PARAMETER["Latitude_Of_Origin",0.0],
UNIT["Meter",1.0],
VERTCS["Yellow_Sea_1985",
VDATUM["Yellow_Sea_1985"],
PARAMETER["Vertical_Shift",0.0],
PARAMETER["Direction",1.0],
UNIT["Meter",1.0]]]
Origin = (397979.950000000011642,3131020.049999999813735)
Pixel Size = (0.100000000000000,-0.100000000000000)
Metadata:
AREA_OR_POINT=Area
Image Structure Metadata:
INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left ( 397979.950, 3131020.050) (112d57'36.15"E, 28d17'24.04"N)
Lower Left ( 397979.950, 3126979.950) (112d57'37.43"E, 28d15'12.82"N)
Upper Right ( 403020.050, 3131020.050) (113d 0'41.09"E, 28d17'25.41"N)
Lower Right ( 403020.050, 3126979.950) (113d 0'42.30"E, 28d15'14.19"N)
Center ( 400500.000, 3129000.000) (112d59' 9.24"E, 28d16'19.12"N)
Band 1 Block=256x256 Type=Byte, ColorInterp=Red
Band 2 Block=256x256 Type=Byte, ColorInterp=Green
Band 3 Block=256x256 Type=Byte, ColorInterp=Blue
@81662550 I think you can speed up the the CutTiles step and reproject step by increasing the partitions number.
You also don't need that large cluster according to these stats; and you can allocate more tiny executors with a single core each to see what would happen.
However the actual use case is to ingest a folder of hundreds of tiffs into the storage, tiffs should not be large, but the folder can be large.
By ingesting tiny things that can be handled in memory of a single machine (in your case) you introduce an IO overhead - nodes load data in a distributed fashion and after that performing operations on scale and it can be longer since some shuffle can happen, nodes need to communicate, etc. Distributed systems are not magic, the network is slower than the memory of a single machine.
Also what is the time line, what means '26 layers' in the initial post? 26 tiffs? or 26 geotrellis layers with zoom levels, 26 geotrellis layers without taking into account zoom level?
@pomadchin Thanks for your suggestion. I'll try. Thank you and your wonderful geotrellis.
I am sorry for the ambiguous. The zoom level is 20.
@81662550 You mention a faster solution with MPI and Mapnik. Can you share more about that or the code? Looking to do something similar.
Closing this one, since it is not an issue, however you are welcome to reopen it.
Hi Grigory, I generated the geotiffs using "-co TILES=YES" option to avoid long stripes in my images. But I am still getting very long running time. This is my setting:
// Number of Spark Partitions. Recommended to go 5-10x the number of cores
val kSparkPartitions = 80
// Tile Size in pixels. 256 is pretty standard for visual tiles
val kTileSize = 256
// Since the data is integral, we use Nearest Neighbor resampling method.
// For float values, we prefer Bilinear.
val kResampleMethod = NearestNeighbor
// Maximum Zoom for which to pain the layers. Each zoom level is an order
// of magnitude larger than the last. 13 has a scale of 1:70K, appropriate
// for representing villages and suburbs.
val kMinZoom = 8
val kMaxZoom = 13
val conf =
new SparkConf()
.setMaster("local[8]")
.setAppName("Spark Tiler")
implicit val sc = new SparkContext(conf)
// `geoTiffFiles` is the list of files containing those 2000 images I mentioned above
var geoTiffRDDs = sc.union(geoTiffFiles.map(f => sc.hadoopMultibandGeoTiffRDD(f.getAbsolutePath)))
val (_, rasterMetaData) = CollectTileLayerMetadata.fromRDD(geoTiffRDDs, FloatingLayoutScheme(256))
// The Spark program will take a very long time computing just the metadata
// println(rasterMetaData)
// System.exit(0)
val tiled: RDD[(SpatialKey, MultibandTile)] =
geoTiffRDDs.tileToLayout(rasterMetaData, Bilinear)
// Layer Name which will be used for reading the ingested data layer as well
// as the directory for tiles to be written in.
val kLayerName = "nlcd-pennsylvania"
// Path to local tiles directory, used for writing the tile images
val localTilesPath = new java.io.File(
"/data/dump/test_geotrellis/tiles/"
).getAbsolutePath
// Set zoomed layout scheme with projection and tile size
val layoutScheme = ZoomedLayoutScheme(WebMercator, tileSize = kTileSize)
// We need to reproject the tiles to WebMercator
val (zoom, reprojected): (Int, RDD[(SpatialKey, MultibandTile)] with Metadata[TileLayerMetadata[SpatialKey]]) =
MultibandTileLayerRDD(tiled, rasterMetaData)
.reproject(WebMercator, layoutScheme, Bilinear)
val pathTemplate = s"$localTilesPath/$kLayerName/{z}/{x}/{y}.png"
Pyramid.levelStream(
reprojected,
layoutScheme,
zoom,
// kMinZoom,
// kMaxZoom,
kResampleMethod
).foreach { case (z, levelLayer) =>
// For each zoom level layer, find the spatial key to z/x/y path
val paintedLayerId = LayerId(kLayerName, z)
val keyToPath = SaveToHadoop.spatialKeyToPath(paintedLayerId, pathTemplate)
// Paint the layer values with the color map from above and save
levelLayer
.mapValues(_.band(0).renderPng().bytes)
.saveToHadoop(keyToPath)
}
}
Many thanks in advance!
Hi, everyone. I am using geotrellis to generate TMS tiles from GeoTiffs. I've taken a collection of high-resolution GeoTiffs(the number of tiffs is about 200) and a large merged GeoTiff file (about 50GB, and HDFS block size 128MB) as the input, respectively. And the tiles are saved in HDFS with saveToHadoopo function. However, by using the Pyramid.LevelSteream function, it will take approximately an hour to generate all the 20 layers of 256X256 tiles. The cluster consists of 10 nodes, and each node has 64GB main memory.
Then, we implemented the same algorithm with MPI and Mapnik, respectively. They just take less than fifteen minutes to generate all the tiles.
According to your test, is this result correct? Or our usage is incorrect?
The key code is as follows:
Thanks very much!