pangeo-data / pangeo-rema

Project with Jonny analyzing REMA data
Apache License 2.0
0 stars 1 forks source link

Create a STAC catalog for the REMA COG data #2

Open rabernat opened 5 years ago

rabernat commented 5 years ago

Once the data has been converted to COG (see #1), it needs to be uploaded to the cloud and cataloged. The uploading itself is easy. The cataloging is the hard part/

Probably the best way to do the catalog is with the spatiotemporal asset catalog specification: https://github.com/radiantearth/stac-spec/

This basically means creating a bunch of .json files which describe the data collection. There are various automated tools and scripts to generate these: https://github.com/radiantearth/stac-spec/blob/master/implementations.md#ecosystem

rabernat commented 5 years ago

Some background on cloud storage:

Our bucket is https://console.cloud.google.com/storage/browser/pangeo-pgc?project=pangeo-181919&folder&organizationId=819335046878

rabernat commented 5 years ago

Project is pangeo-181919

rabernat commented 5 years ago

Command to upload (recusive)

gsutil -m cp -r /path/to/data gs://pangeo-pgc/
rabernat commented 5 years ago

Check this out! It works! http://www.cogeo.org/map/#/url/https%3A%2F%2Fstorage.googleapis.com%2Fpangeo-pgc%2F38_48_8m_dem_COG_RAW.tif/center/66.774,-72.615/zoom/7

jkingslake commented 5 years ago

@jjspergel the code you sent me for producing the json files is running. It works with only one copy of the satstac.py file sat in the '8m' directory if i use the following

forfiles /s /m *_COG_LZW.tif /c "cmd /c python ../../satstac.py @path @file"

One thing, though is the file structure may not be exactly as I put them up in the bucket. Looking at this example: 09_39_8m_dem_COG_LZW.txt It looks like it is https://storage.googleapis.com/pangeo-pgc/09_39_8m_dem_COG_LZW.json when it should be https://storage.googleapis.com/pangeo-pgc/8m/09_39_8m_dem_COG_LZW.json

https://console.cloud.google.com/storage/browser/pangeo-pgc?project=pangeo-181919&folder&organizationId=819335046878

i will try to edit the satstac.py file and rerun i think.

jjspergel commented 5 years ago

Sorry @jkingslake , but yes I think that should be an easy fix - just edit the "link_pref" to include the subfolder "8m". I also would recommend renaming the py file, my compiler got confused later because both the module and the code I wrote are called satstac.

jkingslake commented 5 years ago

Done. Where should the json files need to sit? I suspect the answer is that it doesnt matter, but just to check before I upload them.

jjspergel commented 5 years ago

I'm working right now on how we want to combine them into a json catalog. I don't think it really matters, just as long as they can be easily accessed by the code that will combine/index them all.

On Wed, Apr 3, 2019 at 3:37 PM jkingslake notifications@github.com wrote:

Done. Where should the json files need to sit? I suspect the answer is that it doesnt matter, but just to check before I upload them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rabernat/pangeo-rema/issues/2#issuecomment-479629563, or mute the thread https://github.com/notifications/unsubscribe-auth/Ao-5EZFtUJezbkyzdL8QioFMaRMn9FBDks5vdQL1gaJpZM4b8NLw .

-- Julian J. Spergel Lamont Doherty Earth Observatory University of Chicago '16 Geophysical Sciences

jjspergel commented 5 years ago

I would hold off on uploading the .json files. I'm having some issues with combining them. They might need a little bit of tweaking to get the correct syntax and format to work with the code. Sorry for the inconvenience.

On Wed, Apr 3, 2019 at 3:41 PM Julian Jacob Spergel jjs2268@columbia.edu wrote:

I'm working right now on how we want to combine them into a json catalog. I don't think it really matters, just as long as they can be easily accessed by the code that will combine/index them all.

On Wed, Apr 3, 2019 at 3:37 PM jkingslake notifications@github.com wrote:

Done. Where should the json files need to sit? I suspect the answer is that it doesnt matter, but just to check before I upload them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rabernat/pangeo-rema/issues/2#issuecomment-479629563, or mute the thread https://github.com/notifications/unsubscribe-auth/Ao-5EZFtUJezbkyzdL8QioFMaRMn9FBDks5vdQL1gaJpZM4b8NLw .

-- Julian J. Spergel Lamont Doherty Earth Observatory University of Chicago '16 Geophysical Sciences

-- Julian J. Spergel Lamont Doherty Earth Observatory University of Chicago '16 Geophysical Sciences

rabernat commented 5 years ago

Glad to see stuff happening without me!

Just want to remind you that you can use this git repo to share the scripts you are using to generate the data, produce stac, etc (rather than emailing python files around).

jjspergel commented 5 years ago

It seems like the satstac program can only catalog these json files if there's no "link" section of the .json file already, so we'll need to remove that from the json code and remake all the .json files. I'm still working on getting them to catalog/combine properly.

The issue right now is that the automatically generated links look like: 'https://storage.googleapis.com/pangeo-pgc/8m\\stac-catalog\\34_39_8m_dem_COG_LZW.json'

which I think will create problems when uploaded. I'll keep you updated.

jjspergel commented 5 years ago

Hi Jonny,

I forget how long it took to make .json files last time, but if it's a long time, would it be possible to run the new code overnight tonight? The only change is the removal of the link-forming section. As before use:

forfiles /s /m *_COG_LZW.tif /c "cmd /c python ../../satstac_jsoncode.py @path https://github.com/path @file https://github.com/file

Let me know if you can run this tonight or not. No real rush, I just want to schedule work this week. Thanks!

On Wed, Apr 3, 2019 at 3:37 PM jkingslake notifications@github.com wrote:

Done. Where should the json files need to sit? I suspect the answer is that it doesnt matter, but just to check before I upload them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rabernat/pangeo-rema/issues/2#issuecomment-479629563, or mute the thread https://github.com/notifications/unsubscribe-auth/Ao-5EZFtUJezbkyzdL8QioFMaRMn9FBDks5vdQL1gaJpZM4b8NLw .

-- Julian J. Spergel Lamont Doherty Earth Observatory University of Chicago '16 Geophysical Sciences

def satstac(rasterPath,rasterName):

from satstac import Catalog, Collection, Item
import xarray as xr
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import rasterio
import rasterio.features
import rasterio.warp
from shapely import geometry
import json
import cartopy

class REMAStacItem:

    def __init__(self,rasterPath, rasterName):

        self.rasterPath = rasterPath
        self.id      = rasterName
        self.stac_item   = {"id": rasterName[:-4],
                            "rownum": str(rasterName)[0:2],
                            "colnum": str(rasterName)[3:5],
                            "type": "Feature",
                            "bbox": self.calcBBox(),
                            "geometry": self.calcGeometry(),  
                            "units": 'meters',
                            "resolution": 8
                           }

        self.stac_item['properties']=self.createProperties_meta()
        self.stac_item['links'] = self.create_links()

    def calcBBox(self):

        with rasterio.open(self.rasterPath) as dataset:
            bbox = rasterio.warp.transform_bounds(dataset.crs, 'EPSG:4326',dataset.bounds.left,dataset.bounds.bottom,
                                          dataset.bounds.right,dataset.bounds.top)

        #self.bbox=json.loads(json.dumps(geometry.mapping(bbox)))

        return json.loads(json.dumps(bbox))   

    def calcGeometry(self):

        with rasterio.open(self.rasterPath) as dataset:
            profile = dataset.profile
                # Read the dataset's valid data mask as a ndarray.
            geom = geometry.box(*rasterio.warp.transform_bounds(dataset.crs, 'EPSG:4326',dataset.bounds.left,dataset.bounds.bottom,
                                          dataset.bounds.right,dataset.bounds.top))

        self.geometry=json.loads(json.dumps(geometry.mapping(geom)))

        return json.loads(json.dumps(geometry.mapping(geom)))

    #def createAssetList(self):
     #   pass

    def createProperties_meta(self):
        meta_prop_dict = {}
        with rasterio.open(self.id) as dataset:
                meta_prop_dict = dataset.meta
                crs = meta_prop_dict['crs']
                crs = 'EPSG:'+str(crs.to_epsg())
                meta_prop_dict['crs'] = crs
                profile = dataset.profile
        return meta_prop_dict

     #   if imdPath:
      #      o = urlparse(imdPath)
       #     if o.scheme == 's3':
        #        s3 = boto3.resource("s3")
         #       bucket = o.netloc
          #      key = o.path.lstrip('/')
           #     obj = s3.Object(bucket, key)
            #    body = obj.get()['Body'].read()
             #   root = ET.fromstring(body)

            #else:

             #   tree = ET.parse(imdPath)
              #  root = tree.getroot()

            #self.metaDataStruct = iterate_OverXML(root, recursionTagList=['IMD', 'IMAGE'])
        #elif vrtPath:
         #   with rasterio.open(vrtPath) as src:
          #      self.metaDataStruct = process_nitf_tags(src.tags())

       # else:
        #    print("no Data Specified")
        #self.eoDict = self.processMetaData_To_Properties(self.metaDataStruct, self.provider, self.license)
        #return json.loads(json.dumps(meta_prop_dict))

    def write_toJSON(self, filename):
        with open(filename, 'w') as fp:
            json.dump(self.stac_item, fp, indent=2)

myStacItem = REMAStacItem(rasterPath,rasterName)
filename = myStacItem.stac_item['id']+'.json'
myStacItem.write_toJSON(filename)
jkingslake commented 5 years ago

Julian, Yes, I can do this. Although you will have to remind me the details. Jonny

On Mon, Jun 10, 2019 at 3:56 PM jjspergel notifications@github.com wrote:

Hi Jonny,

I forget how long it took to make .json files last time, but if it's a long time, would it be possible to run the new code overnight tonight? The only change is the removal of the link-forming section. As before use:

forfiles /s /m *_COG_LZW.tif /c "cmd /c python ../../satstac_jsoncode.py @path https://github.com/path @file https://github.com/file

Let me know if you can run this tonight or not. No real rush, I just want to schedule work this week. Thanks!

On Wed, Apr 3, 2019 at 3:37 PM jkingslake notifications@github.com wrote:

Done. Where should the json files need to sit? I suspect the answer is that it doesnt matter, but just to check before I upload them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/rabernat/pangeo-rema/issues/2#issuecomment-479629563 , or mute the thread < https://github.com/notifications/unsubscribe-auth/Ao-5EZFtUJezbkyzdL8QioFMaRMn9FBDks5vdQL1gaJpZM4b8NLw

.

-- Julian J. Spergel Lamont Doherty Earth Observatory University of Chicago '16 Geophysical Sciences

def satstac(rasterPath,rasterName):

from satstac import Catalog, Collection, Item import xarray as xr import pandas as pd import numpy as np from matplotlib import pyplot as plt import rasterio import rasterio.features import rasterio.warp from shapely import geometry import json import cartopy

class REMAStacItem:

def init(self,rasterPath, rasterName):

self.rasterPath = rasterPath self.id = rasterName self.stac_item = {"id": rasterName[:-4], "rownum": str(rasterName)[0:2], "colnum": str(rasterName)[3:5], "type": "Feature", "bbox": self.calcBBox(), "geometry": self.calcGeometry(), "units": 'meters', "resolution": 8 }

self.stac_item['properties']=self.createProperties_meta() self.stac_item['links'] = self.create_links()

def calcBBox(self):

with rasterio.open(self.rasterPath) as dataset: bbox = rasterio.warp.transform_bounds(dataset.crs, 'EPSG:4326',dataset.bounds.left,dataset.bounds.bottom, dataset.bounds.right,dataset.bounds.top)

self.bbox=json.loads(json.dumps(geometry.mapping(bbox)))

return json.loads(json.dumps(bbox))

def calcGeometry(self):

with rasterio.open(self.rasterPath) as dataset: profile = dataset.profile

Read the dataset's valid data mask as a ndarray.

geom = geometry.box(*rasterio.warp.transform_bounds(dataset.crs, 'EPSG:4326',dataset.bounds.left,dataset.bounds.bottom, dataset.bounds.right,dataset.bounds.top))

self.geometry=json.loads(json.dumps(geometry.mapping(geom)))

return json.loads(json.dumps(geometry.mapping(geom)))

def createAssetList(self):

pass

def createProperties_meta(self): meta_prop_dict = {} with rasterio.open(self.id) as dataset: meta_prop_dict = dataset.meta crs = meta_prop_dict['crs'] crs = 'EPSG:'+str(crs.to_epsg()) meta_prop_dict['crs'] = crs profile = dataset.profile return meta_prop_dict

if imdPath:

o = urlparse(imdPath)

if o.scheme == 's3':

s3 = boto3.resource("s3")

bucket = o.netloc

key = o.path.lstrip('/')

obj = s3.Object(bucket, key)

body = obj.get()['Body'].read()

root = ET.fromstring(body)

else:

tree = ET.parse(imdPath)

root = tree.getroot()

self.metaDataStruct = iterate_OverXML(root, recursionTagList=['IMD',

'IMAGE'])

elif vrtPath:

with rasterio.open(vrtPath) as src:

self.metaDataStruct = process_nitf_tags(src.tags())

else:

print("no Data Specified")

self.eoDict = self.processMetaData_To_Properties(self.metaDataStruct,

self.provider, self.license)

return json.loads(json.dumps(meta_prop_dict))

def write_toJSON(self, filename): with open(filename, 'w') as fp: json.dump(self.stac_item, fp, indent=2)

myStacItem = REMAStacItem(rasterPath,rasterName) filename = myStacItem.stac_item['id']+'.json' myStacItem.write_toJSON(filename)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rabernat/pangeo-rema/issues/2?email_source=notifications&email_token=ALTXJ3L6SAINUDNEUULRPXTPZ2WVLA5CNFSM4G7Q2LYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXLBUHA#issuecomment-500570652, or mute the thread https://github.com/notifications/unsubscribe-auth/ALTXJ3IQ7R64NPJNBEGBGDLPZ2WVLANCNFSM4G7Q2LYA .

--

Jonathan Kingslake

Assistant Professor

Dept. Earth and Environmental Sciences,

Lamont-Doherty Earth Observatory of Columbia University

61 Rt. 9W, Palisades, NY 10964-8000

jkingslake.com

@jkingslake

jjspergel commented 5 years ago

It's been a little while for me too. I believe you just need to move the python code I sent to the folder with the REMA tiles in it, run a terminal with the directory in that same folder with the REMA files, and then run the line of code I sent. If it doesn't work, maybe we can try it tomorrow together.

Julian

On Mon, Jun 10, 2019 at 4:16 PM jkingslake notifications@github.com wrote:

Julian, Yes, I can do this. Although you will have to remind me the details. Jonny

On Mon, Jun 10, 2019 at 3:56 PM jjspergel notifications@github.com wrote:

Hi Jonny,

I forget how long it took to make .json files last time, but if it's a long time, would it be possible to run the new code overnight tonight? The only change is the removal of the link-forming section. As before use:

forfiles /s /m *_COG_LZW.tif /c "cmd /c python ../../satstac_jsoncode.py @path https://github.com/path @file https://github.com/file

Let me know if you can run this tonight or not. No real rush, I just want to schedule work this week. Thanks!

On Wed, Apr 3, 2019 at 3:37 PM jkingslake notifications@github.com wrote:

Done. Where should the json files need to sit? I suspect the answer is that it doesnt matter, but just to check before I upload them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/rabernat/pangeo-rema/issues/2#issuecomment-479629563 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/Ao-5EZFtUJezbkyzdL8QioFMaRMn9FBDks5vdQL1gaJpZM4b8NLw

.

-- Julian J. Spergel Lamont Doherty Earth Observatory University of Chicago '16 Geophysical Sciences

def satstac(rasterPath,rasterName):

from satstac import Catalog, Collection, Item import xarray as xr import pandas as pd import numpy as np from matplotlib import pyplot as plt import rasterio import rasterio.features import rasterio.warp from shapely import geometry import json import cartopy

class REMAStacItem:

def init(self,rasterPath, rasterName):

self.rasterPath = rasterPath self.id = rasterName self.stac_item = {"id": rasterName[:-4], "rownum": str(rasterName)[0:2], "colnum": str(rasterName)[3:5], "type": "Feature", "bbox": self.calcBBox(), "geometry": self.calcGeometry(), "units": 'meters', "resolution": 8 }

self.stac_item['properties']=self.createProperties_meta() self.stac_item['links'] = self.create_links()

def calcBBox(self):

with rasterio.open(self.rasterPath) as dataset: bbox = rasterio.warp.transform_bounds(dataset.crs, 'EPSG:4326',dataset.bounds.left,dataset.bounds.bottom, dataset.bounds.right,dataset.bounds.top)

self.bbox=json.loads(json.dumps(geometry.mapping(bbox)))

return json.loads(json.dumps(bbox))

def calcGeometry(self):

with rasterio.open(self.rasterPath) as dataset: profile = dataset.profile

Read the dataset's valid data mask as a ndarray.

geom = geometry.box(*rasterio.warp.transform_bounds(dataset.crs, 'EPSG:4326',dataset.bounds.left,dataset.bounds.bottom, dataset.bounds.right,dataset.bounds.top))

self.geometry=json.loads(json.dumps(geometry.mapping(geom)))

return json.loads(json.dumps(geometry.mapping(geom)))

def createAssetList(self):

pass

def createProperties_meta(self): meta_prop_dict = {} with rasterio.open(self.id) as dataset: meta_prop_dict = dataset.meta crs = meta_prop_dict['crs'] crs = 'EPSG:'+str(crs.to_epsg()) meta_prop_dict['crs'] = crs profile = dataset.profile return meta_prop_dict

if imdPath:

o = urlparse(imdPath)

if o.scheme == 's3':

s3 = boto3.resource("s3")

bucket = o.netloc

key = o.path.lstrip('/')

obj = s3.Object(bucket, key)

body = obj.get()['Body'].read()

root = ET.fromstring(body)

else:

tree = ET.parse(imdPath)

root = tree.getroot()

self.metaDataStruct = iterate_OverXML(root, recursionTagList=['IMD',

'IMAGE'])

elif vrtPath:

with rasterio.open(vrtPath) as src:

self.metaDataStruct = process_nitf_tags(src.tags())

else:

print("no Data Specified")

self.eoDict = self.processMetaData_To_Properties(self.metaDataStruct,

self.provider, self.license)

return json.loads(json.dumps(meta_prop_dict))

def write_toJSON(self, filename): with open(filename, 'w') as fp: json.dump(self.stac_item, fp, indent=2)

myStacItem = REMAStacItem(rasterPath,rasterName) filename = myStacItem.stac_item['id']+'.json' myStacItem.write_toJSON(filename)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/rabernat/pangeo-rema/issues/2?email_source=notifications&email_token=ALTXJ3L6SAINUDNEUULRPXTPZ2WVLA5CNFSM4G7Q2LYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXLBUHA#issuecomment-500570652 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ALTXJ3IQ7R64NPJNBEGBGDLPZ2WVLANCNFSM4G7Q2LYA

.

--

Jonathan Kingslake

Assistant Professor

Dept. Earth and Environmental Sciences,

Lamont-Doherty Earth Observatory of Columbia University

61 Rt. 9W, Palisades, NY 10964-8000

jkingslake.com

@jkingslake

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rabernat/pangeo-rema/issues/2?email_source=notifications&email_token=AKH3SENWAB5ZU7PMGL3E4VTPZ2Y7VA5CNFSM4G7Q2LYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXLDJOI#issuecomment-500577465, or mute the thread https://github.com/notifications/unsubscribe-auth/AKH3SENV5BYDTCRAQMJUUK3PZ2Y7VANCNFSM4G7Q2LYA .

-- Julian J. Spergel Lamont Doherty Earth Observatory University of Chicago '16 Geophysical Sciences

jkingslake commented 5 years ago

The latest code for creating json files is here

jjspergel commented 5 years ago

I have code to collect the json files that works. Let's meet before implementing it so I can make sure that we have a file structure and metadata that we agree on. then we can upload it all to the Google Bucket! `` satstac_collection.txt