oceansites / software

Collection of software utilities and code in a variety of languages to search, create, manipulate or generally 'use' OceanSITES data
2 stars 3 forks source link

Trying to run after 2-year hiatus #4

Closed MBARIMike closed 6 years ago

MBARIMike commented 6 years ago

I tried this on a new VM with python 3.6:

git clone https://github.com/oceansites/software.git
cd software/compliance_report/
pip install -r requirements.txt
time ./compliance_report.py --test cf acdd --format summary http://data.nodc.noaa.gov/thredds/catalog/ndbc/oceansites/DATA/catalog.html > OS_GDAC.csv

and got this error:

$ time ./compliance_report.py --test cf acdd --format summary http://data.nodc.noaa.gov/thredds/catalog/ndbc/oceansites/DATA/catalog.html > OS_GDAC.csv
Traceback (most recent call last):
  File "./compliance_report.py", line 21, in <module>
    from bs4 import BeautifulSoup
  File "/vagrant/dev/stoqsgit/venv-stoqs/lib64/python3.6/site-packages/bs4/__init__.py", line 30, in <module>
    from .builder import builder_registry, ParserRejectedMarkup
  File "/vagrant/dev/stoqsgit/venv-stoqs/lib64/python3.6/site-packages/bs4/builder/__init__.py", line 314, in <module>
    from . import _html5lib
  File "/vagrant/dev/stoqsgit/venv-stoqs/lib64/python3.6/site-packages/bs4/builder/_html5lib.py", line 70, in <module>
    class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
AttributeError: module 'html5lib.treebuilders' has no attribute '_base'

real    0m0.395s
user    0m0.168s
sys 0m0.098s

Looks like there's some work to do...

MBARIMike commented 6 years ago

Now trying to get this to work in Anaconda on a mac...

conda create --name compliance_report
source activate compliance_report
conda install -c anaconda netcdf4
while read requirement; do conda install -c conda-forge --yes $requirement; done < requirements.txt
conda install -c conda-forge --yes cython

Some of the above can probably be simplified, but now

python compliance_report.py --test cf acdd --format summary http://data.nodc.noaa.gov/thredds/catalog/ndbc/oceansites/DATA/MBARI/catalog.xml

gives a report:

...
https://dods.ndbc.noaa.gov/thredds/dodsC/oceansites/DATA/MBARI/OS_MBARI-M1_20140716_R_TS.nc,80.9,93.8
https://dods.ndbc.noaa.gov/thredds/dodsC/oceansites/DATA/MBARI/OS_MBARI-M1_20150729_R_M.nc,83.3,89.0
https://dods.ndbc.noaa.gov/thredds/dodsC/oceansites/DATA/MBARI/OS_MBARI-M1_20150729_R_TS.nc,80.9,93.8
https://dods.ndbc.noaa.gov/thredds/dodsC/oceansites/DATA/MBARI/OS_MBARI-M1_20150730_R_M.nc,83.3,94.1
https://dods.ndbc.noaa.gov/thredds/dodsC/oceansites/DATA/MBARI/OS_MBARI-M1_20150730_R_TS.nc,80.9,93.8
...
MBARIMike commented 6 years ago

I had Python 3.5 installed on my Mac via an Anaconda installation I did a year ago.

I updated it to Python 3.6 (the current one at the time) with:

conda update --prefix /Users/mccann/anaconda anaconda

It took a while.

Then I re-created my environment:

conda remove --name compliance_report --all
conda create --name compliance_report
source activate compliance_report
conda install -c conda-forge compliance-checker
conda install -c conda-forge beautifulsoup4

This took 10s of minutes.

Now, at least this compiles:

python compliance_report.py --test cf acdd --format summary -v "http://data.nodc.noaa.gov/thredds/catalog/ndbc/oceansites/DATA/catalog.xml"
MBARIMike commented 6 years ago

Looks like this commit in the compliance-checker api broke compliance_report.py.

MBARIMike commented 6 years ago

After this commit I now get this:

(compliance_report) medusa-3:compliance_report mccann$ python compliance_report.py --test cf:1.6 acdd --format summary "http://data.nodc.noaa.gov/thredds/catalog/ndbc/oceansites/DATA/MBARI/catalog.xml"
url,acdd,cf:1.6
http://data.nodc.noaa.gov/thredds/dodsC/ndbc/oceansites/DATA/MBARI/OS_MBARI-M2_20100402_R_TS.nc,78.9,99.1
http://data.nodc.noaa.gov/thredds/dodsC/ndbc/oceansites/DATA/MBARI/OS_MBARI-M2_20100402_R_M.nc,68.5,99.4
http://data.nodc.noaa.gov/thredds/dodsC/ndbc/oceansites/DATA/MBARI/OS_MBARI-M2_20100401_R_TS.nc,77.6,99.1
/Users/mccann/anaconda/envs/compliance_report/lib/python3.6/site-packages/compliance_checker/acdd.py:258: UserWarning: WARNING: valid_min not used since it
cannot be safely cast to variable data type
...

That warning message seems new...

MBARIMike commented 6 years ago

It takes a loooonnnnggg time to crawl the OPeNDAP directories to produce these compliance reports. Following up on suggestion to use xarray and asyncio to speed things up.

MBARIMike commented 6 years ago

Following the Kiel meeting, the script now works; see this Jupyter Notebook.

Like most all software compliance_report.py is a work in progress. It can be improved by making it work in a parallel fashion using xarrray, dask, and asyncio. This will be tracked in a new issue.

benjwadams commented 6 years ago

@MBARIMike, in the past, I've tried parallelizing the Compliance Checker workloads but ran into some issues. I'm probably going to be making an issue regarding this. If you'd like to see enhanced support for multiple file datasets, please feel free to submit a feature request.