PERF: Load data (create Series, Dataframe) in a more functional way.

tinproject commented 10 years ago

I've been trying to find out the process of creating DataFrames in order to try to solve #2305 with minimal memory use. I've made some tests that put in a IPython notebook: http://nbviewer.ipython.org/gist/tinproject/7d0e0de9475b16910fcf

Currently when reads from a generator pandas internally allocate the whole generator in memory within a list, using at least twice the memory needed. I suspect sometimes could be even more if there is a later type conversion needed.

Also, as it can be viewed in the notebook, the input data collection it's completely readed many times before data it's loaded in the final data structure.

To read from an iterator, without put the whole it in memory, it's needed to process the values that yield one by one, or at least in a one pass form.

One it could read data from a iterator in one pass, it could read from every iterable collection only needing the count of elements to read. It includes collections with implicit lenght: sequences(list, tuple, str), sets, mappings, etc. and collections with unknown length: generators, iterators, etc. which size must be explicitly given.

I think this could end in a great improve on performance and enhance the input data types accepted (In my tests I saw that the tuple collection it's not a valid type for DataFrame!)

Relates to #2305, #2193, #5898

I want to help to solve this, but it's a bis task, and I'm lost when comes to the Blocks part. Is there anywhere documented what structure have and how Blocks works?

jreback commented 10 years ago

can you give an actual use case of this?

are you trying to create really large frames?

and if you are, then why would you not write a disk-based storage, say HDF and then read chunks as needed?

tinproject commented 10 years ago

Load data efficiently from a remote endpoint, a database query for example, you have a iterator (cursor, pointer to data) and how many values the query have. The same with an HDF file. Iterator to elemets and count of elements (cursor tpye) could be consider as an universal endpoint to any kind of external library

The problem originally arise when I try to load some data on my 4GB machine. I read a big file with a generator and try to load it directly on pandas, I can't. I solve this creating an empty numpy.ndarray and loading the data of my generator directly on the array. Then pass the numpy.ndarray to pandas.

Currently we have better machines but it'a a nosense not to use it efficiently.

jreback commented 10 years ago

ok, so you want to read say a 1GB of data (say that's what the frame ends up being). But right now you say it say takes 2GB to create. ok prob true. What happens if you magically create it for only 1GB then say perform an operation with it (saydf+1), which creates another object of 1GB? You have exactly the same problem.

The way to do this is to operate out-of-core. I didn't mean iterator over HDF per se, but rather

In chunks, create an intermediate store from your generator (doesn't have to be HDF but generally the most efficient). Then operate on those chunks.

Doesn't matter how much memory you have; when you have a dataset that is a large fraction of your memory and you want to do actual operations your are forced to do it this way.

I think you can attack the iterator problem in a very similar way.

Process a chunk of data, form to a frame, repeat, then concat these at the end. I think a way to do this in pandas would be nice; I also think that doing this in the constructor is not intuitive (well you could do it internally I suppose, but you maybe need a more sophisticated way of doing this, including additional arguments, not only the count of the generator, but a place for intermediate storage, etc.)

so you could do something like this:

df = DataFrame(data=GeneratorObject(src, addtitional arg))...

where the GeneratorObject is a pandas object that could handle the construction via chunks

jreback commented 10 years ago

see #3202 which is a template for doing this (a GeneratorObject could be a form of this actually)

tinproject commented 10 years ago

I'm not thinking on any particular data, the goal it's to solve the bug and improve pandas.

You are rigth, it's always needed space for operations but I'm only talking about loading data. When load chunked data that it's not in memory you always need twice the memory, half for the chunks dataframes, half for the final dataframe. Currently pandas only load data from memory to memory (previously allocated, externally or internally). Which I try to express it's load data from x directly to memory, x being disk, network, memory or whatever.

This approach may help to load the data from the pipeline without the need to store the whole tank first.

The way of loading data in pandas internally it's column base, if wants to load data from an element/row base (iterator) efficiently it might change the way of doing it.

I'll start with the simplest case, a Series without index, it loads 0-dimensional data (elements) from a 1-dimensional collection. Elements can come in one type of data and pandas store in another type, that could be explicitly expressed (dtype) or infered.

The data collection could be of one type:

Iterable of known length:
- Sequence (list, tuple, str..), numeric index access
- Sets, unordered access (but one arbitrary order for the iterator)
- Mappings (dicts..), key access
Iterable of unknown length (must set explicitely):
- Generator, Iterator..
Non iterable
- It's there any collection object in python?

To load the data could do the following:

Take an iterator to the data
How many elements are?

When read data from a unknown length collection it must be explicitely given. If not expressed takes it with len(), if collection is sized. Otherwise there it's no other solution than read the whole data to an intermediary list.
Which type the elements are? It will be stored in some other type?

This could be explictely indicated with dtype property. If not it have to infer it. It could be done by reading the first element of the collection. As it trying to do this in one pass, it assume that the type of the first element is going to be compatible with the rest elements.
Allocate the memory needed for the array and store the first element.

This array it's at first empty (np.empty for example), or na filled.
Store the rest of colection in data structure.

Element type could be tracked without store the whole collection in memory, if someone its from a different type it could raise a exception, treat it like a na value, or reallocate the data in a new array with the new type before move to the next element.
Fit data structure.

If the iterator exhausted before reach the number of elements of count, resize the array to the correct size

This structure can be applied to whatever kind of data collection. The question it's to do the checks and data loading in a flip way, by elements (rows) rather than columns.

This example can be consider the easiest case, the harder case it's when a DataFrame reads from a mapping iterator.

jreback commented 10 years ago

you are welcome to try to implement a solution

you can't change the API and must pass all current tests

lmk if u have questions

tinproject commented 10 years ago

@jreback Could you point me to any kind of doc/explanations of how internals/Block API works?

jreback commented 10 years ago

No docs. You just have to step thru creation code. Here's a mini-version.

BlockManagers hold blocks that are divided by dtype. A block is a container that holds items which label the 0'th dimension of a block; ref_items are reference (e.g. by take) back to where these livie in the BlockManager. A block generally has 2 or more dimensions (even in the case of a Series); a SparseBlock is the exception. The 0th dimension in block is the info_axis (e.g. columns for a Frame, they are 'backwards'). Panels have 3-d blocks. Again the 0th is are the items (the info_axis).

Block are created then merged/consolidated. Very rarely are they actually modified in-place. When you see an in-place operation it atually is creating a new block manager (possibly with the same blocks) and replace in the top-level object. (thats the ._data attribute).

Their are some non-trivial things going on, esp when you have duplicate items which are the whole _ref_locs bizness.

You just have to spend some time with it tracing flow.

leifwalsh commented 8 years ago

I am also interested in creating a DataFrame in place from a streaming source (generators producing tuples or dicts). Happy to write some C code to get this done, but I don't know where to start.

jreback commented 8 years ago

@leifwalsh not even really sure what this issue was/is about. I think a concrete proposal of 'have pandas do this' in pseudo code would be the best.

tinproject commented 8 years ago

This issue was about creating a DataFrame from an iterator-like object avoiding copying and allocate more memory that it's needed, ideally only the resulting DataFrame space.

Note: I'm not aware of the current state of pandas, and perhaps I'm completely mistaken, but as the OP I'll try to explain what this issue was about (at least two years ago)

Suppose that we got a table compound of records(rows) that have a number of fields. This table was accesible by an iterator-like object but their data it's not on python memory space,the data could be originally a remote connection to an sql cursor, a csv file, or whatever...

Because we only have the access to the table from a interator it means that we couldn't access the whole table at one time, we only could read a record from the table at once. Python iterators don't have a notion of index, the only operation available it's 'give me the next value', and then the reference to that value(record) it's lost if not stored.

Pandas need to access the whole table to infer what are the data types of the record's fields, so the current approach it's to read the whole iterator in a list and then the data from the table it's on the python memory space and pandas can read the whole table and infer the dtypes before creating the final ndarrays for the DataFrame.

The proposal it's a constructor for a DataFrame that pre-allocates the internals ndarray before read the data. As we read from an iterator it need to define a count (number of records readed from the iterator) and the data types for the fields of the record to be able to allocate the correct size ndarrays Then read the data into it.

This is the more simple case, a defined count of records and the correct data types, a stricter approach could be feasible to implement this and saves memory.

A more generalized version is far more trickier:

Iterator could be exhausted before get to the count of records, whats the desired behavior then?
What if the count it's not known? A chunked approach could be used.
Record could be tuples, list, sequences with a positional index, what if they are not homogeneous?
Records could be also dicts(mappings), what if there are missing fields?
Data types of fields could not match with defined ones.

I remember time ago when I was looking the pandas code to try to solve this that the load of a DataFrame it's done in a columnar (Series) fashion, this is the main problem to load the data in a row(record) oriented way, from an iterator, without putting all the data from the iterator in memory.

jreback commented 8 years ago

@tinproject this is not the usecase of a pandas dataframe, dealing with streaming data is a non-trivial problem. In some ways it could be accomodated by pandas, but fundamentally the storage layer needs data materialized in memory.

dask and distributed can accomplish what I think you are looking. (and in fact use pandas under the hood for computation).

tinproject commented 8 years ago

@jreback I'm not talking about streaming data, I'm talking about load data in a row fashion. When I'm looking at the code two years ago I realize that it's imposible to load data in pandas in a row way without making a complete refactor that would not be easily aproved as a PR.

Apart of the previous process of infer data types, that is done field by filed traveling the whole field column, DataFrames are formed by a group of Series, in other words, columnar data types, this means in the end that to load a table from a file you need twice the memory of the file size.

The alternative to save memory, as its pointed on other related issues, is load the data as numpy ndarrays and then pass them directly to pandas to form the DataFrame.

I understand that it's hard and a involves a big change. Thanks for all the great work in pandas!!!

leifwalsh commented 8 years ago

Suppose you have a streaming data source like

def streaming_source():
    for something in something_else:
        yield {'key1': val1, 'key2': val2}

You can currently have pandas create a dataframe with columns key1 and key2 with something like

rows = list(streaming_source())
df = pd.DataFrame(rows)

or if you like, a bit better for not repeating keys,

cols = {}
for row in streaming_source():
    for key, val in row.items():
        cols.setdefault(key, []).append(val)
df = pd.DataFrame(cols)

However, both of these approaches require building a python list of python objects in memory, inside the interpreter, until you have all the data, then asking pandas to reify the dataframe structure. These objects will be spread all over the heap and be larger than the actual data you want, and while you can do it from the C API you're still calling into libpython rather than just building arrays of doubles.

The idea I have in my mind is something that can take a generator or provide an "append" method which builds up numpy ndarrays that grow in place (with doubling arrays or a list of slabs or something smarter), but do so in the packed, column-oriented format we eventually want them to be in. Then, once built, we would hand them to pandas and pandas wouldn't need to rearrange anything.

I did a little bit of digging around numpy, couldn't easily find documentation that describes the memory layout well, and didn't want to spend too much time flopping around aimlessly on my own. If someone can point me to the right documentation I could probably whip up a prototype.

Perhaps this is a job for Arrow? cc @wesm

EDIT: to be clear, I don't think this is actually a pandas feature, I think it's a numpy feature. I came here hoping to find someone with more numpy experience.

wesm commented 8 years ago

@leifwalsh sorry it's taken me a month to reply on this. I think having fast streaming data accumulators that yield DataFrames at the end is a good idea -- we have a lot of code floating around for coercing incoming data (with fixed number of rows) into a DataFrame. There's also the internals of pandas.read_csv which deal with type inference and other matters.

As one example of code that I recently worked, on, the converter functions for converting HiveServer2 columns into a pandas-compatible representation. This includes deduplicating string values (since you don't want to create tons of copies of the same string):

https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h

As we've discussed at various times on pandas-dev and elsewhere, NumPy hasn't supported pandas particularly well on these types of matters, particularly for non-numeric data. But in the meantime, it's just about creating the "right" NumPy arrays internally (for example: if you have boolean data with nulls, you will have to return a dtype=object ndarray)

leifwalsh commented 8 years ago

@wesm thanks. I think we're aligned on the problem: do the work to construct numpy arrays from a streaming source without involving a Python list as an intermediary. I don't know how to do this yet but could learn. Need to prioritize and schedule the work. I think it's fine if numpy doesn't support us in this and we just figure out the numpy format and construct it in c++ as long as we can hand the result to numpy. Then we can play with dynamic allocation strategies to our hearts' content.

wesm commented 8 years ago

We would probably want to do something semantically similar to the builder classes in the Apache Arrow codebase:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L141

Create some std::vector-like object that accumulates data in an internal buffer (that grows by factor of 1.5x or 2x when full / reallocation required) and then giving ownership of the final buffer to a NumPy array. Doesn't need to be in C++ (actually, probably better right now for it not to be until we start digging in on pandas 2.x -- see the pandas-dev@python.org mailing list)

One important thing to keep in mind: you will want to deduplicate string data as it's coming in (each call to PyString_FromStringAndSize allocates memory, Python has no interning mechanism / global hash table by default). It might also make sense to have an option to convert all strings to categorical.

leifwalsh commented 8 years ago

Yup. Sounds like we're on the same page. I just don't know enough about numpy to know exactly how to structure this data in memory as I receive it. We can dig in soon.

tinproject commented 8 years ago

do the work to construct numpy arrays from a streaming source without involving a Python list as an intermediary.

@leifwalsh you summarize far, far better than me what this is all about: avoid the intermediary list, just change 'numpy arrays' with 'pandas DataFrame' and 'streaming source' with 'iterator' and it was my original wish.

Numpy arrays are fixed size and fixed data type, memory is allocated on it's creation, if you want to increase the size of the numpy array the only way is to copy it to a new memory location, dynamic memory allocation is far from the numpy scope. I believe it could be possible to downsize an numpy array in place, depending on the platform.

Because we want to avoid memory copying the right numpy arrays should be created once, this implies that we must know some metadata of the streaming source/iterator that we need to create the arrays: the number of records (size), and the data types of the fields of the record. With that data you could create the necessary numpy arrays to put your data in. To avoid creating an intermediary list and having that we only have access to the current record, all the data from the current record must be processed in one time, so all the numpy arrays must be created before read the data from the streaming source. When you have the data in the np arrays you could create a DataFrame telling pandas that use your np array and not create a new ones.

If you want to work on this I believe the easier is to restrict moving parts: constrain the streaming source metadata (size and types) to some known values, and later try to deal with unknown size or types sources.

shoyer commented 8 years ago

One unfortunate complication of pandas's current memory model (with the block manager) is that it is only sometimes column oriented -- if adjacent columns all have different dtypes. If you pass in a dict of numpy.ndarrays giving DataFrame columns of the same dtype, these will be "consolidated" into a single 2D numpy.ndarray in C-contiguous order, which means row oriented! This makes it very difficult to construct DataFrames, even from things memory-mapping to numpy arrays, without the risk of unnecessary copies. For what it's worth, this is something that we definitely hope to fix in pandas 2.0.

wesm commented 8 years ago

Invested parties should get involved in the discussion on #13944 -- the hope here is that streaming data collectors could efficiently construct the internals of Series / DataFrame and guaranteeing contiguousness and zero-copy on construction afterwards (there are plenty of cases where you can have "unavoidable consolidation" inside the guts of DataFrame right now).

pandas-dev / pandas

PERF: Load data (create Series, Dataframe) in a more functional way. #5902