Accept pandas dataframes for `row_attrs` and `col_attrs`

linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org

BSD 2-Clause "Simplified" License

140 stars 37 forks source link

Accept pandas dataframes for `row_attrs` and `col_attrs` #11

Closed olgabot closed 7 years ago

olgabot commented 7 years ago

All of my gene and cell metadata is stored as pandas dataframes and it's a pain to have to convert those to dictionaries every time for loompy. Can loompy simply accept Pandas dataframes for these attributes?

JobLeonard commented 7 years ago

I'm not familiar with pandas, but looking at the API I guess you would like have the dataframe column names to be converted to attribute names, and the matching column to a global attribute?

That seems like a useful enough convenience function.

On the other hand, loompy is currently relatively bare-bones - we're not using pandas in the library, and I think it would be nice if we don't force it one people who don't use it.

What about making some kind of glue- or wrapper-library that adds this functionality, or monkey-patches it in?

JobLeonard commented 7 years ago

@gioelelm, you also use pandas a lot, right?

Would it make sense to create a separate loom-pandas package with is merely a wrapper around loompy that handles conversions back and forth?

gioelelm commented 7 years ago

@JobLeonard I personally prefer to work directly with numpy, especially for big data to have more control on the efficiency of my matrix operations. However pandas is a big deal in pydata community so it is reasonable to provide some function that bridge the two packages.

However I don't think it is worth to go as far as making another package. A method load_attrs_from_df should do. Maybe even doing this is an overkill, in fact, all it takes for an appropriate conversion from pandas to a col attr dictionary should be a single function (actually method) call.

>>> import pandas as pd
>>> df = pd.DataFrame({'col1': [1, 2,3], 'col2': [0.5, 0.75, 1]}, index=['a', 'b','c'])
>>> df.to_dict("list")
{'col1': [1, 2, 3], 'col2': [0.5, 0.75, 1.0]}

I am sure that @olgabot is referring to some more tricky situations, but then the question is how to predict all the possible scenarios since there is no standard pandas format for storing this kind of metadata.

JobLeonard commented 7 years ago

Well, if the "base case" is that simple, we should probably include an example in the documentation - that is, a small section with "integrating with pandas"

slinnarsson commented 7 years ago

A pandas-oriented tutorial would be good! But no special code unless there’s a very compelling case.

Sten

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden<x-apple-data-detectors://1/0> +46 8 52 48 75 77<tel:+46%208%2052%2048%2075%2077> (office) +46 70 399 32 06<tel:+46%2070%20399%2032%2006> (mobile)

4 nov. 2017 kl. 16:10 skrev Job van der Zwan notifications@github.com<mailto:notifications@github.com>:

Well, if the "base case" is that simple, we should probably include an example in the documentation - that is, a small section with "integrating with pandas"

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/loompy/issues/11#issuecomment-341904056, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKKagwQoqLwzFaCtR8jtqWoWvqT3WdJIks5szH5OgaJpZM4QOlsy.

slinnarsson commented 7 years ago

I pushed a fix that does more extensive normalization of inputs during create() and set_attr(). You should now be able to pass list, tuple, np.ndarray, np.matrix or scipy.sparse, and the elements can be any kind of string, string object, or number. All will be normalized to conform to the spec.

You can now directly convert a pandas DataFrame to a row/col dictionary for create(), like @gioelelm suggested (but now it works):

col_attrs = df.to_dict("list")

Let me know if this is good enough.