mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
74 stars 53 forks source link

Feature Analysis to find best features and issue #24 #97

Closed devayanipowar closed 4 years ago

devayanipowar commented 5 years ago

I was getting a lot of memory issues even with 10% dataset hence I have used CSV files to do feature analysis. I am still trying to work in python and would like to still work after the deadline so that I could get your inputs in further improving my heuristic. I was trying to use entropy for feature engineering and predict if a script would be blocked by adblock.

devayanipowar commented 5 years ago

I am definitely going to work on this in order to get this to a mergeable state because the question of which feature affects what part is keeping me awake actually. I am just trying to find a way to handle my memory issues and a way to convert the data into vectors so that I can calculate entropy successfully. I have done it with a small dataset in pandas. Are all panda functions used in a similar way in dask?

birdsarah commented 5 years ago

If you have clean pandas code you can often do not much more than replace the data frame at the top with a dask dataframe.

If you do one complete notebook in pandas with just one file and then push it, I could review it for things that would be likely to trip you up when switching to dask.

Have you reviewed my dask tips notebook? https://github.com/mozilla/overscripted/blob/master/analyses/issue_34_setup_and_dask_tips.ipynb

If you're having a hard time with memory, the other recommendation I would make is to not use dask.distributed. So you would not set-up a "Client" which you will see in all my code.

Instead you just do

from dask.diagnostics import ProgressBar
import dask.dataframe as dd

df = dd.read_parquet(.....
# normal dask stuff
# when you want to get out a result
with ProgressBar():
    n_scripts = df.script_url.nunique.compute()

There is no persist in this case (which I would guess might have been what was causing you problems). Only try to compute small things e.g. nunique or a list that you expect to be short. If you are trying to get new derived data that's potentially large, then write it out to a new file and then read it in again using dask. That way you're not trying to stick it in memory.

devayanipowar commented 5 years ago

Thank you so much. I Will try it this way.

aliamcami commented 4 years ago

Closing this PR due to lack of activity, please feel free to reopen.