man-group / dtale

Visualizer for pandas data structures
http://alphatechadmin.pythonanywhere.com
GNU Lesser General Public License v2.1
4.73k stars 402 forks source link

Unable to read dataframe on colab #212

Closed jainayush007 closed 4 years ago

jainayush007 commented 4 years ago

Hi. I am unable to read pandas data frame into d-tale. Below is the error -

dtale.show(df)

JSONDecodeError Traceback (most recent call last)

in () 7 dtale_app.USE_NGROK = True 8 ----> 9 dtale.show(df)

4 frames /usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx) 355 obj, end = self.scan_once(s, idx) 356 except StopIteration as err: --> 357 raise JSONDecodeError("Expecting value", s, err.value) from None 358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Screenshot of error-

image

df was loaded from a file that can be fetched from below link - 'https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/coronavirus/corona_dataset_latest.csv'

aschonfeld commented 4 years ago

So I'm not sure how you're loading your data but I was able to load it fine using the following:

import dtale
import dtale.app as dtale_app

dtale_app.USE_NGROK = True

url = 'https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/coronavirus/corona_dataset_latest.csv'
dtale.show_csv(path=url)

I'm also working on adding the ability to specify the parameter index_col so you won't get you first column as "Unnamed: 0"

So then your call would be: dtale.show_csv(path=url, index_col=0)

FYI, under the hood show_csv is running this code:

import pandas as pd
import requests
from six import PY3, BytesIO, StringIO

def show_csv(**kwargs):
    path = kwargs.pop("path")
    if path.startswith("http://") or path.startswith(
        "https://"
    ):  # add support for URLs
        proxy = kwargs.pop("proxy", None)
        req_kwargs = {}
        if proxy is not None:
            req_kwargs["proxies"] = dict(http=proxy, https=proxy)
        resp = requests.get(path, **req_kwargs)
        assert resp.status_code == 200
        path = BytesIO(resp.content) if PY3 else StringIO(resp.content.decode("utf-8"))
    return pd.read_csv(path, **kwargs)
jainayush007 commented 4 years ago

Thanks for the response. I had loaded it into a Koalas dataframe and just performed to_pandas.

image

post this I performed : pdf = kdf.toPandas() and remaining script is already part of my issue description. So:

  1. .toPandas() df conversion from koalas to Pandas works but the isnt usable with d-tale.
  2. If I would have read the data directly with Pandas(prefer to use koalas; which will be useful for large datasets), then too I would face same issue?
jainayush007 commented 4 years ago

Also, I was unable to replicate your successful scenario of being able to load the data. Anything missing? -

image

aschonfeld commented 4 years ago

Interesting that show_csv didn't work for you. It worked fine for me locally. With & without the index_col parameter added. Can you try using that show_csv function I included in my previous comment? Just to see if that loads the data?

jainayush007 commented 4 years ago

Intrestingly, that worked!

image

aschonfeld commented 4 years ago

Hmm, well you could always do dtale.show(show_csv(...)). Using dtale.show_csv worked for me with v1.9.0 🤷‍♂️

jainayush007 commented 4 years ago

I am on v1.9.0 too and it still didnt work for me. I am on colab -

image

jainayush007 commented 4 years ago

I believe this issue should be re-opened.

aschonfeld commented 4 years ago

This worked fine for me in google colab: image

Here is are the versions of all my packages installed are (just run !pip freeze to see what versions you have):

absl-py==0.9.0 alabaster==0.7.12 albumentations==0.1.12 altair==4.1.0 asgiref==3.2.10 astor==0.8.1 astropy==4.0.1.post1 astunparse==1.6.3 atari-py==0.2.6 atomicwrites==1.4.0 attrs==19.3.0 audioread==2.1.8 autograd==1.3 Babel==2.8.0 backcall==0.2.0 beautifulsoup4==4.6.3 bleach==3.1.5 blis==0.4.1 bokeh==1.4.0 boto==2.49.0 boto3==1.14.9 botocore==1.17.9 Bottleneck==1.3.2 branca==0.4.1 Brotli==1.0.7 bs4==0.0.1 CacheControl==0.12.6 cachetools==4.1.0 catalogue==1.0.0 certifi==2020.6.20 cffi==1.14.0 chainer==6.5.0 chardet==3.0.4 click==7.1.2 cloudpickle==1.3.0 cmake==3.12.0 cmdstanpy==0.4.0 colorlover==0.3.0 community==1.0.0b1 contextlib2==0.5.5 convertdate==2.2.1 coverage==3.7.1 coveralls==0.5 crcmod==1.7 cufflinks==0.17.3 cvxopt==1.2.5 cvxpy==1.0.31 cycler==0.10.0 cymem==2.0.3 Cython==0.29.20 daft==0.0.4 dash==1.13.4 dash-bootstrap-components==0.10.3 dash-colorscales==0.0.4 dash-core-components==1.10.1 dash-daq==0.5.0 dash-html-components==1.0.3 dash-renderer==1.5.1 dash-table==4.8.1 dask==2.12.0 dataclasses==0.7 datascience==0.10.6 decorator==4.4.2 defusedxml==0.6.0 descartes==1.1.0 dill==0.3.2 distributed==1.25.3 Django==3.0.7 dlib==19.18.0 docopt==0.6.2 docutils==0.15.2 dopamine-rl==1.0.5 dtale==1.9.1 earthengine-api==0.1.226 easydict==1.9 ecos==2.0.7.post1 editdistance==0.5.3 en-core-web-sm==2.2.5 entrypoints==0.3 ephem==3.7.7.1 et-xmlfile==1.0.1 fa2==0.3.5 fancyimpute==0.4.3 fastai==1.0.61 fastdtw==0.3.4 fastprogress==0.2.3 fastrlock==0.5 fbprophet==0.6 feather-format==0.4.1 featuretools==0.4.1 filelock==3.0.12 firebase-admin==4.1.0 fix-yahoo-finance==0.0.22 Flask==1.1.2 Flask-Compress==1.5.0 flask-ngrok==0.0.25 folium==0.8.3 fsspec==0.7.4 future==0.16.0 gast==0.3.3 GDAL==2.2.2 gdown==3.6.4 gensim==3.6.0 geographiclib==1.50 geopy==1.17.0 gin-config==0.3.0 glob2==0.7 google==2.0.3 google-api-core==1.16.0 google-api-python-client==1.7.12 google-auth==1.17.2 google-auth-httplib2==0.0.3 google-auth-oauthlib==0.4.1 google-cloud-bigquery==1.21.0 google-cloud-core==1.0.3 google-cloud-datastore==1.8.0 google-cloud-firestore==1.7.0 google-cloud-language==1.2.0 google-cloud-storage==1.18.1 google-cloud-translate==1.5.0 google-colab==1.0.0 google-pasta==0.2.0 google-resumable-media==0.4.1 googleapis-common-protos==1.52.0 googledrivedownloader==0.4 graphviz==0.10.1 grpcio==1.30.0 gspread==3.0.1 gspread-dataframe==3.0.7 gym==0.17.2 h5py==2.10.0 HeapDict==1.0.1 holidays==0.9.12 html5lib==1.0.1 httpimport==0.5.18 httplib2==0.17.4 httplib2shim==0.0.3 humanize==0.5.1 hyperopt==0.1.2 ideep4py==2.0.0.post3 idna==2.9 image==1.5.32 imageio==2.4.1 imagesize==1.2.0 imbalanced-learn==0.4.3 imblearn==0.0 imgaug==0.2.9 importlib-metadata==1.6.1 imutils==0.5.3 inflect==2.1.0 intel-openmp==2020.0.133 intervaltree==2.1.0 ipykernel==4.10.1 ipython==5.5.0 ipython-genutils==0.2.0 ipython-sql==0.3.9 ipywidgets==7.5.1 itsdangerous==1.1.0 jax==0.1.69 jaxlib==0.1.47 jdcal==1.4.1 jedi==0.17.1 jieba==0.42.1 Jinja2==2.11.2 jmespath==0.10.0 joblib==0.15.1 jpeg4py==0.1.4 jsonschema==2.6.0 jupyter==1.0.0 jupyter-client==5.3.4 jupyter-console==5.2.0 jupyter-core==4.6.3 kaggle==1.5.6 kapre==0.1.3.1 Keras==2.3.1 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.2 keras-vis==0.4.1 kiwisolver==1.2.0 knnimpute==0.1.0 librosa==0.6.3 lightgbm==2.2.3 llvmlite==0.31.0 lmdb==0.98 lucid==0.3.8 LunarCalendar==0.0.9 lxml==4.2.6 lz4==3.1.0 Markdown==3.2.2 MarkupSafe==1.1.1 matplotlib==3.2.2 matplotlib-venn==0.11.5 missingno==0.4.2 mistune==0.8.4 mizani==0.6.0 mkl==2019.0 mlxtend==0.14.0 more-itertools==8.4.0 moviepy==0.2.3.5 mpmath==1.1.0 msgpack==1.0.0 multiprocess==0.70.10 multitasking==0.0.9 murmurhash==1.0.2 music21==5.5.0 natsort==5.5.0 nbconvert==5.6.1 nbformat==5.0.7 networkx==2.4 nibabel==3.0.2 nltk==3.2.5 notebook==5.2.2 np-utils==0.5.12.1 numba==0.48.0 numexpr==2.7.1 numpy==1.18.5 nvidia-ml-py3==7.352.0 oauth2client==4.1.3 oauthlib==3.1.0 okgrade==0.4.3 opencv-contrib-python==4.1.2.30 opencv-python==4.1.2.30 openpyxl==2.5.9 opt-einsum==3.2.1 osqp==0.6.1 packaging==20.4 palettable==3.3.0 pandas==1.0.5 pandas-datareader==0.8.1 pandas-gbq==0.11.0 pandas-profiling==1.4.1 pandocfilters==1.4.2 parso==0.7.0 pathlib==1.0.1 patsy==0.5.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==7.0.0 pip-tools==4.5.1 plac==1.1.3 plotly==4.4.1 plotnine==0.6.0 pluggy==0.7.1 portpicker==1.3.1 prefetch-generator==1.0.1 preshed==3.0.2 prettytable==0.7.2 progressbar2==3.38.0 prometheus-client==0.8.0 promise==2.3 prompt-toolkit==1.0.18 protobuf==3.10.0 psutil==5.4.8 psycopg2==2.7.6.1 ptyprocess==0.6.0 py==1.8.2 pyarrow==0.14.1 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycocotools==2.0.1 pycparser==2.20 pydata-google-auth==1.1.0 pydot==1.3.0 pydot-ng==2.0.0 pydotplus==2.0.2 PyDrive==1.3.1 pyemd==0.5.1 pyglet==1.5.0 Pygments==2.1.3 pygobject==3.26.1 pymc3==3.7 PyMeeus==0.3.7 pymongo==3.10.1 pymystem3==0.2.0 PyOpenGL==3.1.5 pyparsing==2.4.7 pyrsistent==0.16.0 pysndfile==1.3.8 PySocks==1.7.1 pystan==2.19.1.1 pytest==3.6.4 python-apt==1.6.5+ubuntu0.3 python-chess==0.23.11 python-dateutil==2.8.1 python-louvain==0.14 python-slugify==4.0.0 python-utils==2.4.0 pytz==2018.9 PyWavelets==1.1.1 PyYAML==3.13 pyzmq==19.0.1 qtconsole==4.7.5 QtPy==1.9.0 regex==2019.12.20 requests==2.23.0 requests-oauthlib==1.3.0 resampy==0.2.2 retrying==1.3.3 rpy2==3.2.7 rsa==4.6 s3fs==0.4.2 s3transfer==0.3.3 scikit-image==0.16.2 scikit-learn==0.22.2.post1 scipy==1.4.1 screen-resolution-extra==0.0.0 scs==2.1.2 seaborn==0.10.1 Send2Trash==1.5.0 setuptools-git==1.2 Shapely==1.7.0 simplegeneric==0.8.1 six==1.15.0 sklearn==0.0 sklearn-pandas==1.8.0 smart-open==2.0.0 snowballstemmer==2.0.0 sortedcontainers==2.2.2 spacy==2.2.4 Sphinx==1.8.5 sphinxcontrib-websupport==1.2.2 SQLAlchemy==1.3.17 sqlparse==0.3.1 srsly==1.0.2 statsmodels==0.10.2 sympy==1.1.1 tables==3.4.4 tabulate==0.8.7 tbb==2020.0.133 tblib==1.6.0 tensorboard==2.2.2 tensorboard-plugin-wit==1.6.0.post3 tensorboardcolab==0.0.22 tensorflow==2.2.0 tensorflow-addons==0.8.3 tensorflow-datasets==2.1.0 tensorflow-estimator==2.2.0 tensorflow-gcs-config==2.2.0 tensorflow-hub==0.8.0 tensorflow-metadata==0.22.2 tensorflow-privacy==0.2.2 tensorflow-probability==0.10.0 termcolor==1.1.0 terminado==0.8.3 testpath==0.4.4 text-unidecode==1.3 textblob==0.15.3 textgenrnn==1.4.1 Theano==1.0.4 thinc==7.4.0 tifffile==2020.6.3 toolz==0.10.0 torch==1.5.1+cu101 torchsummary==1.5.1 torchtext==0.3.1 torchvision==0.6.1+cu101 tornado==4.5.3 tqdm==4.41.1 traitlets==4.3.3 tweepy==3.6.0 typeguard==2.7.1 typing==3.6.6 typing-extensions==3.6.6 tzlocal==1.5.1 umap-learn==0.4.4 uritemplate==3.0.1 urllib3==1.24.3 vega-datasets==0.8.0 wasabi==0.7.0 wcwidth==0.2.5 webencodings==0.5.1 Werkzeug==1.0.1 widgetsnbextension==3.5.1 wordcloud==1.5.0 wrapt==1.12.1 xarray==0.15.1 xgboost==0.90 xkit==0.0.0 xlrd==1.1.0 xlwt==1.3.0 yellowbrick==0.9.1 zict==2.0.0 zipp==3.1.0

jainayush007 commented 4 years ago

So, these are the difference found -

plotly==4.4.1 plotly==4.8.2
pyarrow==0.14.1 pyarrow==0.15.1

I need to upgrade plotly, so that can enable pandas backend usage as plotly and pyarrow upgrade for Koalas.

!pip install -U plotly !pip install pyarrow==0.15.1

aschonfeld commented 4 years ago

But by downgrading those did it fix it? The issue seems to be with some sort of dependency. I dont have anything pinned in D-Tale so i’m kind of at the mercy of people’s environments.

Honestly, it seems like some sort of character encoding problem which seems odd since you’re using google colab and I should see the same issue too. Which version of python are you using? I’m using 3.6 i believe

aschonfeld commented 4 years ago

Is there any chance theres more to the stacktrace? I see it says “4 frames” on your screenshot. It might be a little easier to debug if I can see where the issue originates.

You can also try doing: d = dtale.show(show_csv(path=url)) d._main_url

Then seeing if you can access that link. It might be an jupyter problem

jainayush007 commented 4 years ago

This is interesting, I just instantiated my colab notebook and it ran the script again. I got a new error this time -

image

I could successfully load the file through koalas and convert to a pyspark and pandas df. I was later able to load pandas and koalas dataframe to dtale as well -

image

No difference in any libraries between my last and current results of !pip freeze

aschonfeld commented 4 years ago

The first error, i’m assuming, is because you hadn’t loaded the show_csv function into memory yet. Can you try executing the cell with the show_csv definition and then running the code again?

TanushGoel commented 4 years ago

You can always download it in one line via a "wget" shell command. Then use pandas to parse and head the file.

Screen Shot 2020-07-05 at 5 35 28 AM
jainayush007 commented 4 years ago

@TanushGoel - Thanks for the tip! Are there any intrinsic benefits(memory?, storage?) of loading in pandas. I believe it loads the file in google colab vm which shouldn't impact my 15gb storage limit?

@aschonfeld - You were right! I missed loading the function. Is there a way to avoid loading the function?

image

Am wondering what could have happened earlier, that it didn't load. I check the !pip freeze and all libraries are exactly same as yesterday

aschonfeld commented 4 years ago

So the show_csv function I gave you was just so I could show you what code was being executed under the hood of D-Tale. If you use d = dtale.show_csv(path=url) it should do the same thing.

So the only thing I can think of is that something with google colab doesn't like when it tries to return the D-Tale instance directly which was why I told you to store it in a variable d and then pull the url for viewing using _main_url.

That being said, I was able to view it fine in my google colab notebook without storing my instance in a variable. So the only thing I can think of that is causing the issue is that you have spark installed in your notebook and I don't. I know spark does some special stuff to environments using java so there has to be something to that...

I don't think there is any intrinsic benefits to loading pandas other than the fact that D-Tale is built for pandas data structures (think of my earlier post where I showed the exception from trying to pass a koalas dataframe directly). I don't have a ton of knowledge about the memory management of spark so maybe you'd get some benefit there 🤔

jainayush007 commented 4 years ago

Thanks for your help! Closing the issue.

jainayush007 commented 4 years ago

seems like the d._main_url isn't functional anymore. I probably wont need it since I can see the link is generated but wanted to bring to your attention.

[image: image.png]

On Mon, Jul 6, 2020 at 3:22 AM Andrew Schonfeld notifications@github.com wrote:

So the show_csv function I gave you was just so I could show you what code was being executed under the hood of D-Tale. If you use d = dtale.show_csv(path=url) it should do the same thing.

So the only thing I can think of is that something with google colab doesn't like when it tries to return the D-Tale instance directly which was why I told you to store it in a variable d and then pull the url for viewing using _main_url.

That being said, I was able to view it fine in my google colab notebook without storing my instance in a variable. So the only thing I can think of that is causing the issue is that you have spark installed in your notebook and I don't. I know spark does some special stuff to environments using java so there has to be something to that...

I don't think there is any intrinsic benefits to loading pandas other than the fact that D-Tale is built for pandas data structures (think of my earlier post where I showed the exception from trying to pass a koalas dataframe directly). I don't have a ton of knowledge about the memory management of spark so maybe you'd get some benefit there 🤔

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/man-group/dtale/issues/212#issuecomment-653943565, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQROEPWDTRFJYCRTLDXOYLR2DYYZANCNFSM4OPBZKKA .

jainayush007 commented 4 years ago

And the dtale webpage doesnt open with the given link!

[image: image.png]

On Fri, Oct 2, 2020 at 1:11 AM Ayush Jain ayushj2303@gmail.com wrote:

seems like the d._main_url isn't functional anymore. I probably wont need it since I can see the link is generated but wanted to bring to your attention.

[image: image.png]

On Mon, Jul 6, 2020 at 3:22 AM Andrew Schonfeld notifications@github.com wrote:

So the show_csv function I gave you was just so I could show you what code was being executed under the hood of D-Tale. If you use d = dtale.show_csv(path=url) it should do the same thing.

So the only thing I can think of is that something with google colab doesn't like when it tries to return the D-Tale instance directly which was why I told you to store it in a variable d and then pull the url for viewing using _main_url.

That being said, I was able to view it fine in my google colab notebook without storing my instance in a variable. So the only thing I can think of that is causing the issue is that you have spark installed in your notebook and I don't. I know spark does some special stuff to environments using java so there has to be something to that...

I don't think there is any intrinsic benefits to loading pandas other than the fact that D-Tale is built for pandas data structures (think of my earlier post where I showed the exception from trying to pass a koalas dataframe directly). I don't have a ton of knowledge about the memory management of spark so maybe you'd get some benefit there 🤔

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/man-group/dtale/issues/212#issuecomment-653943565, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQROEPWDTRFJYCRTLDXOYLR2DYYZANCNFSM4OPBZKKA .

aschonfeld commented 4 years ago

Hmm, still seems to work for me https://youtu.be/sICRVt2ywFs

jainayush007 commented 4 years ago

I think if you run it twice, it changes to an incorrect URL.

Also, the _main_url seems to be non working.

On Fri, Oct 2, 2020 at 7:27 AM Andrew Schonfeld notifications@github.com wrote:

Hmm, still seems to work for me https://youtu.be/sICRVt2ywFs

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/man-group/dtale/issues/212#issuecomment-702486163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQROEIXICXVDSZV4GU6CTTSIUXSBANCNFSM4OPBZKKA .