mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
74 stars 53 forks source link

Analysis on issue #36 #71

Closed Aimaanhasan closed 5 years ago

Aimaanhasan commented 5 years ago

Analysis on efficiency and usage of extension arrays in dask

Issue #36

birdsarah commented 5 years ago

Hi @Aimaanhasan - this is a great start. Congrats on getting an analysis PR up. Now the back and forth starts :D.

Some next steps:

  1. I would like to see this run on more than just one parquet file to get a more meaningful understanding of the speed ups. Can you run your analysis on https://public-data.telemetry.mozilla.org/bigcrawl/sample_10percent_value_1000_only.parquet.tar.bz2 or https://public-data.telemetry.mozilla.org/bigcrawl/value_1000_only.parquet.tar.bz2
  2. The notebook gets a little hard to follow due the the fletcher errors. These are good to keep in the notebook, and useful to see. But I think it would be good to pull your write-up and analysis to the top of the notebook, you can then hyperlink to headers further down in your notebook so your write-up can link to the sections of code where you've produced certain results.
  3. I think a summary table and perhaps a plot would be helpful too.
  4. Please add installation instructions or a link to the fletcher docs for people trying to run your code.
  5. Think about how and when you're going to use fletcher and how to measure / document performance changes.

It seems like you might be struggling to convert your columns. To say the fletcher docs are limited would be an understatement. I had to dig around in the fletcher codebase to figure this out, but given that I have now here's some pseudocode that might be useful:

import pyarrow as pa
fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[col] = df[col].astype(fletcher_string_dtype)
Aimaanhasan commented 5 years ago

Hi, @birdsarah! Thank you so much for your feedback.

I am facing some issues and want to ask some questions regarding the changes.

  1. The psuedocode, you have given is not working for me. A typeError occurs. "TypeError: _from_sequence() got an unexpected keyword argument 'copy'" I tried it by using below code df[col] = fr.FletcherDtype(df[col]) This gives the error: TypeError: Column assignment doesn't support type FletcherDtype Converting the dask.DataFrame to pandas.DataFrame ,I am able to convert the columns but after then converting the pandas.DataFrame back to dask.DataFrame crashes the kernel and force restarts it.

  2. I have made the arrangements to make the notebook more readable. I have also added the instructions and link for fletcher docs. Should I commit the changes?

  3. Can you please elaborate more about the kind of summary table and plots?

birdsarah commented 5 years ago

Part 1

df[col] = fr.FletcherDtype(df[col]) Converting the data to pandas and back again will never be a solution. This data does not fit in memory. This is why I gave you the code to show you how to set the column type to fletcher dtype as opposed to creating a whole new column of data which is what your code does.

I can't debug your error without a full traceback.

Part 2

Yes, always be committing and pushing.

Part 3

I'd like to see you work on that yourself. Just think about how to present the information you have gathered carefully.

birdsarah commented 5 years ago

I've just been resting this which gives some context for fletcher so I thought I'd share. https://www.dataschool.io/future-of-pandas/

The trick with dask vs pandas is to remember that dask ends up being lots of little bits of pandas but we have to let dask manage that itself.

Don't get completely stuck, keep trying things and reaching out.

Aimaanhasan commented 5 years ago

Hello @birdsarah, I've tried in many ways to convert the columns of dask.DataFrame type, but it gives me the following error

Approach 1

Used the code below to implement:

`import pyarrow as pa

fletcher_string_dtype = fr.FletcherDtype(pa.string()) df[df.columns[0]] = df[df.columns[0]].astype(fletcher_string_dtype) ` This gives me the following error TypeError Traceback (most recent call last)

in 1 import pyarrow as pa 2 fletcher_string_dtype = fr.FletcherDtype(pa.string()) ----> 3 df[df.columns[0]] = df[df.columns[0]].astype(fletcher_string_dtype) 4 5 C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in astype(self, dtype) 1646 meta = self._meta_nonempty.astype(dtype) 1647 else: -> 1648 meta = self._meta.astype(dtype) 1649 if hasattr(dtype, 'items'): 1650 # Pandas < 0.21.0, no `categories` attribute, so unknown C:\ProgramData\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 176 else: 177 kwargs[new_arg_name] = new_arg_value --> 178 return func(*args, **kwargs) 179 return wrapper 180 return _deprecate_kwarg C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs) 4999 # else, only a single dtype is given 5000 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors, -> 5001 **kwargs) 5002 return self._constructor(new_data).__finalize__(self) 5003 C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs) 3712 3713 def astype(self, dtype, **kwargs): -> 3714 return self.apply('astype', dtype=dtype, **kwargs) 3715 3716 def convert(self, **kwargs): C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 3579 3580 kwargs['mgr'] = self -> 3581 applied = getattr(b, f)(**kwargs) 3582 result_blocks = _extend_blocks(applied, result_blocks) 3583 C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs) 573 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs): 574 return self._astype(dtype, copy=copy, errors=errors, values=values, --> 575 **kwargs) 576 577 def _astype(self, dtype, copy=False, errors='raise', values=None, C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs) 634 635 # astype processing --> 636 dtype = np.dtype(dtype) 637 if self.dtype == dtype: 638 if copy: TypeError: data type not understood -------------------------------------------------------------------------- ### Approach 2 Used the code below to implement: `df[df.columns[0]] = fr.FletcherDtype(df[df.columns[0]])` **This gives me the following error** TypeError Traceback (most recent call last) in ----> 1 df[df.columns[0]]=fr.FletcherDtype(df[df.columns[0]]) 2 3 C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in __setitem__(self, key, value) 2499 df = self.assign(**{k: value for k in key}) 2500 else: -> 2501 df = self.assign(**{key: value}) 2502 2503 self.dask = df.dask C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in assign(self, **kwargs) 2694 callable(v) or pd.api.types.is_scalar(v)): 2695 raise TypeError("Column assignment doesn't support type " -> 2696 "{0}".format(type(v).__name__)) 2697 if callable(v): 2698 kwargs[k] = v(self) TypeError: Column assignment doesn't support type FletcherDtype --------------------------------------------------------------------------- It will be very helpful if you can guide me here. I have tried searching the docs for the solution but failed to do it. However, Fletcher Arrays works perfectly fine with `pandas.DataFrame`. Converting `dask.DataFrame` to `pandas.DataFrame`, then applying Fletcher Arrays is easier and doesn't give an error. Please help me! Thank you.
birdsarah commented 5 years ago

I'm sorry you're having struggles and it's great that you tried a bunch of options. Unfortunately this issue is about figuring out how to work with fletcher. I feel that if I start guiding further from where you are, I'll just be working on the issue myself, which is not the point. I'm going to close this PR for now.