Closed Aimaanhasan closed 5 years ago
Hi @Aimaanhasan - this is a great start. Congrats on getting an analysis PR up. Now the back and forth starts :D.
Some next steps:
It seems like you might be struggling to convert your columns. To say the fletcher docs are limited would be an understatement. I had to dig around in the fletcher codebase to figure this out, but given that I have now here's some pseudocode that might be useful:
import pyarrow as pa
fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[col] = df[col].astype(fletcher_string_dtype)
Hi, @birdsarah! Thank you so much for your feedback.
I am facing some issues and want to ask some questions regarding the changes.
The psuedocode, you have given is not working for me. A typeError occurs.
"TypeError: _from_sequence() got an unexpected keyword argument 'copy'"
I tried it by using below code
df[col] = fr.FletcherDtype(df[col])
This gives the error:
TypeError: Column assignment doesn't support type FletcherDtype
Converting the dask.DataFrame
to pandas.DataFrame
,I am able to convert the columns but after then converting the pandas.DataFrame
back to dask.DataFrame
crashes the kernel and force restarts it.
I have made the arrangements to make the notebook more readable. I have also added the instructions and link for fletcher docs. Should I commit the changes?
Can you please elaborate more about the kind of summary table and plots?
df[col] = fr.FletcherDtype(df[col])
Converting the data to pandas and back again will never be a solution. This data does not fit in memory. This is why I gave you the code to show you how to set the column type to fletcher dtype as opposed to creating a whole new column of data which is what your code does.
I can't debug your error without a full traceback.
Yes, always be committing and pushing.
I'd like to see you work on that yourself. Just think about how to present the information you have gathered carefully.
I've just been resting this which gives some context for fletcher so I thought I'd share. https://www.dataschool.io/future-of-pandas/
The trick with dask vs pandas is to remember that dask ends up being lots of little bits of pandas but we have to let dask manage that itself.
Don't get completely stuck, keep trying things and reaching out.
Hello @birdsarah, I've tried in many ways to convert the columns of dask.DataFrame
type, but it gives me the following error
Used the code below to implement:
`import pyarrow as pa
fletcher_string_dtype = fr.FletcherDtype(pa.string()) df[df.columns[0]] = df[df.columns[0]].astype(fletcher_string_dtype) ` This gives me the following error TypeError Traceback (most recent call last)
I'm sorry you're having struggles and it's great that you tried a bunch of options. Unfortunately this issue is about figuring out how to work with fletcher. I feel that if I start guiding further from where you are, I'll just be working on the issue myself, which is not the point. I'm going to close this PR for now.
Analysis on efficiency and usage of extension arrays in dask
Issue #36