nteract / scrapbook

A library for recording and reading data in notebooks.
https://nteract-scrapbook.readthedocs.io
BSD 3-Clause "New" or "Revised" License
281 stars 26 forks source link

Error is thrown when using GET parameters on cloud storage paths #49

Closed matthiasdv closed 5 years ago

matthiasdv commented 5 years ago

Behaviour

As context, we're storing and reading papermill log data on Azure blob storage. Take the following snippet as an example;

import papermill as pm
import scrapbook as sb
import azure.common as ac
import json

papermill_log_output_query_string = "?sig=some-unique-secret-token"

nb_path = "abs://mystorage.blob.core.windows.net/my-actual-notebook.ipynb"

try:
    nb = sb.read_notebook(nb_path + papermill_log_output_query_string)
except ac.AzureMissingResourceHttpError as err:
    logging.info(err)

This raises the following error;

ValueError: Requires an '.ipynb' file extension. Provided path:...

Expected behaviour

Suggested solution

I liked the original behaviour of papermill.record() which smply raised a warning on the supposedly missing extention. Seeing the warning triggered a developer response of "hey, i've provided a GET parameter in the URL so this is probably fine."

Not throwing an error possibly hinders developers that provide a faulty path but may no longer notice. Any thoughts on this?

codecov[bot] commented 5 years ago

Codecov Report

Merging #49 into master will decrease coverage by 0.4%. The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master      #49      +/-   ##
==========================================
- Coverage   96.14%   95.73%   -0.41%     
==========================================
  Files          10       10              
  Lines         441      446       +5     
==========================================
+ Hits          424      427       +3     
- Misses         17       19       +2
matthiasdv commented 5 years ago

Update: I've changed the ext check to contains instead of endswith() since raising a warning still breaks the flow of the program.

MSeal commented 5 years ago

I think raising a warning instead of an error would be reasonable here as well. I agree that doing contains is better, though we could be a little smarter and do splitext(urlparse(url).path)[-1].endswith('ipynb') for the logic to select the warning. If you wanted to change the code to emit a warning here I'll gladly merge it.

MSeal commented 5 years ago

If you want just this change for now that's fine too. Though a small test to cover the case would be nice either way :)

matthiasdv commented 5 years ago

I've implemented the suggested url ext check. Added a few simple tests too. One thing i'm unsure about here is that Papermill requires the Azure dependency or will throw a missing dependency exception.

In case the Notebook() call checks the path before actually trying to fetch the object, I believe we can safely ignore the dependency. Which I believe is the case.

If I'm wrong about this and we actually need to connect to Aure or AWS before checking the path more work would be needed.

matthiasdv commented 5 years ago

Hmm. Python 2.7 has a module named urlparse, and this is renamed to urllib.parse in python 3.X. This is what's causing the build to fail.

matthiasdv commented 5 years ago

fixed this with a conditional import, but unsure if that's the most idiomatic way