pangeo-data / pangeo-datastore

Pangeo Cloud Datastore
https://catalog.pangeo.io
48 stars 16 forks source link

Error Loading Intake Catalog #125

Open Castronova opened 3 years ago

Castronova commented 3 years ago

I'm getting an error when loading the intake catalog as described in https://catalog.pangeo.io/ and https://github.com/pangeo-data/pangeo-datastore/blob/master/README.md.

>>> from intake import open_catalog

>>> intake.__version__
'0.6.2'

>>> cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")

gives the following exception:

...
...
File "/Users/castro/miniconda2/envs/tiledb/lib/python3.8/site-packages/yaml/scanner.py", line 1238, in scan_flow_scalar_spaces
    raise ScannerError("while scanning a quoted scalar", start_mark,
yaml.scanner.ScannerError: while scanning a quoted scalar
  in "<unicode string>", line 13, column 13:
          path: "{{CATALOG_DI
                ^
found unexpected end of stream
  in "<unicode string>", line 13, column 26:
          path: "{{CATALOG_DI
                             ^

Any help is greatly appreciated.

valpesendorfer commented 3 years ago

Same issue here.

I've also tried the example in README without success.

The browsable catalog on pangeo.io seems to be affected too.

rabernat commented 3 years ago

Sorry for the slow reply. Thanks for reporting these errors.

I am unable to reproduce this error. On https://staging.us-central1-b.gcp.pangeo.io/ with intake version 0.6.2, I was able to run

from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")

without errors.

I've also tried the example in README without success.

The README was using an old style intake syntax. The preferred usage is open_catalog. In #126 I have updated the README.

The browsable catalog on pangeo.io seems to be affected too.

I'm not seeing any problems right now.

image

I wonder if this was a github glitch. @Castronova could you try again?

valpesendorfer commented 3 years ago

Can confirm the browsable catalog works.

However, running a fresh install of intake version 0.6.2, I still get an error running your snippet:

from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")

image

rabernat commented 3 years ago

🤔 I'm confused. Maybe @martindurant has some ideas?

martindurant commented 3 years ago

What version of fsspec is this?

rabernat commented 3 years ago
import fsspec
print(fsspec.__version__)
# -> 2021.04.0
valpesendorfer commented 3 years ago

Was running fsspec version 0.9.0.

After upgrading to 2021.04.0 everything works as expected!

Thanks

rabernat commented 3 years ago

Sorry for the friction!

Martin, can you help me understand the source of this error better? I don't understand why fsspec is involved here. And since it is involved, why is a compatible version not a required dependency of intake?

martindurant commented 3 years ago

Note that the following is equivalent and probably more likely to succeed on all version

cat = intake.open_catalog("github://pangeo-data:pangeo-datastore@/intake-catalogs/master.yaml")

@rabernat : in older versions of fsspec, files that were smaller than a blocksize were always downloaded in the go and thereafter read from an in-memory BytesIO (but the error would have show up for larger files). This bypassed any chance for file caching. It was changed in 0.9.0 to use the standard fetching mechanism. Unfortunately, that makes use of the apparent file size. Github report the file size of the gzipped version of the file, which is smaller than the real size, so you only get part of the file. In 2021.04.1, we explicitly ask for the size without compression ("Accept-Encoding=identity"), which gets the right value. Arguably, Intake should use fs.cat instead of open().read(), making fewer assumptions.