Closed jacktarricone closed 2 years ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
we're grabbing data through a git clone right now, do you think this will work when 70 people try to hit it at once? If not we can put it in the temp s3 bucket within a week of when we present.
I'm curious to find out how that goes! putting a backup on S3 sounds good. I know for sure 70 simultaneous reads from S3 work well if you format your data as COGs.
@jacktarricone I think you should get your pull request successfully integrated. I will add the very few small changes I had in one final PR after this gets in. I am worried it will get very messy if I start editing before you get your pieces uploaded
changed my /tmp paths. having issues with the SQL code in nb #3, but that could be my machine. also took the git merge stuff out of _toc.yml
changed the kernel names on my local notebooks, so we'll see if it passes
@scottyhq not sure what's going wrong here
The logs can be a bit confusing with the cache output. I highly recommend running these notebooks on the jupyterhub (so login, then gh pr checkout 82
). Note that each notebook should run top to bottom without intervention. Imagine somebody trying to run 3 before 2 ,so you need to make sure the data is there in the top cell of each notebook. See how the lidar tutorial does it by creating a release of the repository to have a zip file:
https://snowex.hackweek.io/tutorials/lidar/2_elevation_differencing.html#download-required-tutorial-data https://snowex.hackweek.io/tutorials/lidar/3_common_pitfalls.html
I just tried running 2_elevation_differencing.ipynb but get NameError: name 'data_path' is not defined
@micah-prime and @micahjohnson150 - it seems like the database is down right now? All the code we have using the database is no longer working and when I run code from your tutorials it also fails.
@ZachKeskinen I don't think it is down. It probably due to the snowexsql update which we launched yesterday. To fix this you will have to merge in main as my update just went into main this morning.
PR #86 updated the environment to use snowexsql === 0.3.0
. That is what is gets installed if you were to recreate the environment locally now. Otherwise, on the jupyterhub it's there automatically (from a terminal you can confirm with conda list | grep sql
or echo $JUPYTER_IMAGE
should show quay.io/uwhackweek/snowex:2022.07.07
(unless you were logged in earlier, then you'd have to `File -> Hub Control Panel -> Stop my server' and then Start Server to get the latest image)
Gotcha that makes sense. I will re-install the environment and make sure that fixes it. Thanks
🚀 Deployed on https://deploy-preview-82--snowex2022.netlify.app
@micahjohnson150 Thanks! I updated the version and that seems to fix most things. A few things seem to be different in the cells that use the database from before the update:
I now have CRS mismatches between data sets of snow depth coming from grand mesa where the only difference is the dates I filtered between. However, it seems to be sporadic with the CRS error mismatch for a few runs and then no error now?
that same cell that has the crs mismatch has gone from running in <30 seconds to taking over 2 minutes. Did you add more data to the database so that it is pulling more in now? When I run
qry = session.query(PointData)
qry = qry.filter(PointData.type == 'depth')
qry_feb1 = qry.filter(PointData.date >= date(2020, 1, 31))
qry_feb1 = qry_feb1.filter(PointData.date <= date(2020, 2, 2))
df_feb_1 = query_to_geopandas(qry_feb1, engine)
it takes around 2 minutes but before was running quite quickly?
There now seem to be Nones in sets of data that previously didn't have any? No big deal because I can drop them but might be worth checking on other tutorials that use the database.
Yeah there is a lot of gpr data that got added yesterday. And the big thing that happened last week is the database now has multiple projections in it. So the best it to pick by site_name to focus on a site like grand mesa. I would add the following to your query to clean it up.
qry = qry.filter(PointData.site_name == 'Grand Mesa')
qry = qry.filter(PointData.instrument == 'magnaprobe')
When forming queries definitely try to employ some safe guard tactics to avoid waiting for 5 minutes to find out it is messed up. One strategy is to use .limit(1000)
to test your queries. Another tact I use, especially on the GPR data (~1M points!), is qry.filter(PointData.id % 1000 == 0)
which just skips every 1000 points so I still see if it is the expected pattern but not have to wait forever.
3. There now seem to be Nones in sets of data that previously didn't have any? No big deal because I can drop them but might be worth checking on other tutorials that use the database.
Could you send me a query that you are seeing this in?
So the only piece that has this issue is checking for permittivities. No Nones before but now we get a few if we don't check for them.
qry = qry.filter(LayerData.type == 'permittivity')
df = query_to_geopandas(qry, engine)
es_values = []
# Loop through each snowpit (each unique site-id is a snowpit)
for id in np.unique(df.site_id):
sub = df[df.site_id == id]
# get the permittivity of the highest layer in the snowpack
es_str = sub.sort_values(by = 'depth', ascending = False).iloc[0]['value']
# added this check after the update since we now get Nones
if es_str != None:
es = float(es_str)
if es != None:
es_values.append(es)
Did you add more snowpits in addition to the GPR?
@micahjohnson150 That was exactly what I needed. Look like I was pulling in a bunch of GPR data that I didn't want. Back down to 9 seconds. Thanks so much Micah!
@scottyhq Hey Scott, any thoughts on this error?
added my banner summit example and did some grammatical editing. Zach still needs to do a few more edits before this should be merged into main.
Also @scottyhq, we're grabbing data through a
git clone
right now, do you think this will work when 70 people try to hit it at once? If not we can put it in the temp s3 bucket within a week of when we present.