Open mingnet opened 6 years ago
Hey, @mingnet! As I mentioned in your other issue, I'm really sorry about responding to your issue so late :slightly_frowning_face:
The reason you're getting that error is because LayerWriter.update
fails when trying to update a saved catalog with a layer whose KeyBounds
are outside of the layer's (see here). What this means is that unfortunately your implementation won't work as each layer will have different KeyBounds
.
The most straightforward way around this would be to read all of your files a once and then save them together as one layer. I know you said that you're working with a small cluster, but if you can show me the script you're using as well as give me some info about your cluster, I may be able to point out places where you could improve the performance. I think we should try this first before going to the next alternative (which is more involved/complicated).
I have tried another solution, but I still have some problems. Maybe you are interested to know. I tried to generate a global KeyBounds at the beginning. I am writing another function in the file(./geopyspark-backend/geotrellis/src/main/scala/geopyspark/geotrellis/io/LayerWriterWrapper.scala)
def writeSpatialGlobal(
layerName: String,
spatialRDD: TiledRasterLayer[SpatialKey],
indexStrategy: String
): Unit = {
val id =
spatialRDD.zoomLevel match {
case Some(zoom) => LayerId(layerName, zoom)
case None => LayerId(layerName, 0)
}
val indexKeyBounds = KeyBounds[SpatialKey](SpatialKey(0, 0), SpatialKey(spatialRDD.rdd.metadata.layout.layoutCols, spatialRDD.rdd.metadata.layout.layoutRows))
val indexMethod = getSpatialIndexMethod(indexStrategy)
val keyIndex = indexMethod.createIndex(indexKeyBounds)
layerWriter.write(id, spatialRDD.rdd, keyIndex)
}
I plan to call this function when processing the first batch. This has a global KeyBounds. Then update the data of other batches. But this function is very very slow to execute. As a result, I was very difficult to finish the first batch. Because I don't know enough about geotrellis. So I don't understand why. Just generate a different index. I think it should be as fast as the writeSpatial function.
@mingnet I see. Based on the work you showed, it looks like everything should work okay. What backend are you trying to write to? There can be a lot of I/O involved for some of them, which could greatly increase the running time. Other than what I just mentioned, there could be other causes for slowdown, but I won't be able to say for sure without seeing your Python code.
I am trying to ingest a batch of large tiff images. And my spark cluster doesn't have a lot of memory and resources. So I tried to ingest images in multiple batches
I plan to generate a pyramid of the first tiff image and then write it to disk. Then generate a pyramid of the second tiff image and update to the same directory
I am trying to add an update code
./geopyspark-backend/geotrellis/src/main/scala/geopyspark/geotrellis/io/LayerWriterWrapper.scala
./geopyspark/geotrellis/catalog.py
Then I ingest the data like this.
Then run an error and prompt
What should I do, what good advice?