pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.77k stars 17.97k forks source link

Deprecate Series / DataFrame.append #35407

Closed TomAugspurger closed 2 years ago

TomAugspurger commented 4 years ago

I think that we should deprecate Series.append and DataFrame.append. They're making an analogy to list.append, but it's a poor analogy since the behavior isn't (and can't be) in place. The data for the index and values needs to be copied to create the result.

These are also apparently popular methods. DataFrame.append is around the 10th most visited page in our API docs.

Unless I'm mistaken, users are always better off building up a list of values and passing them to the constructor, or building up a list of NDFrames followed by a single concat.

jreback commented 4 years ago

+1 from me (though i will usually be plus on deprecating things generally)

yeah free here these are a foot gun

erfannariman commented 4 years ago

+1, it's better to have one method, which is pandas.concat, also it's more flexible with the list of dataframes and the option to concat over axis 0 / axis 1.

shoyer commented 4 years ago

Strong +1 from me!

Just look at all the (bad) answers to this StackOverflow question: https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe

jreback commented 4 years ago

we should also deprecate expansion indexing as well (which is an implicit append)

AlexKirko commented 4 years ago

+1 from me There is really no reason to have this when we have concat available. Especially, because IIRC append works by calling concat and I don't think append abstracts away enough to keep it.

achapkowski commented 4 years ago

How do you expand a dataframe by a single row without having to create a whole dataframe then?

TomAugspurger commented 4 years ago

I'd recommend thinking about why you need to expand by a single row. Can those updates be batched before adding to the DataFrame?

If you know the label you want to set it at, then you can use .loc[key] = ... to expand the index without having to create an intermediate. Otherwise you'll need to create a DataFrame and use concat.

darindillon commented 4 years ago

Disagree. Appending a single row is useful functionality and very common. Yes, we understand its inefficient; but as TomAugspurger himself said, this is the 10th most commonly referenced page on the help, so clearly lots of people have this use case of adding a single row to the end. We can tell ourselves we're removing the method to "encourage good design" but people still want this functionality, so they'll just use the workaround of creating a new DataFrame with a single row and concat'ing, but that just requires the user to write even more code to still get the exact same performance hit, so how have we made anyone's life better?

taylor-schneider commented 4 years ago

Not being able to add rows to a data structure makes no sense. It's one thing to not add the inplace argument but to deprecate the feature is nutts.

achapkowski commented 4 years ago

@TomAugspurger using df.loc[] requires me to know the length of the dataframe. and create code like this:

df.iloc[len(df) + 1] = <new row>

This feel like overly complex syntax for an API that makes data operations simple. Internally df.append or series.append could just do what is shown above, but don't dirty up the user interface.

Why not take a page from lists, the append method is quick because it has pre-allocates slots in advanced. Modify the internals post DataFrame/Series creation to have 1000 empty hidden rows slotted and ready to have new information. If/When the slots are filled, then the DF/Series would expand it outside the view of the user.

TomAugspurger commented 4 years ago

loc requires you to know the label you want to insert it at, not the length.

Why not take a page from lists, the append method is quick because it has pre-allocates slots in advanced.

You could perhaps suggest that to NumPy. I don't think it would work in practice given the NumPy data model.

achapkowski commented 4 years ago

Is Numpy deprecating the append method? If not, why deprecate it here?

Numpy doc: https://numpy.org/doc/stable/reference/generated/numpy.append.html

MarcoGorelli commented 3 years ago

Shall we make this happen and get a deprecation warning in for 1.4 so these can be removed in 2.0? If there's no objections, I'll make a PR later (or anyone following along can, that's probably the fastest way to move the conversation forward)

achapkowski commented 3 years ago

@MarcoGorelli my question still stands, why is this being done?

darindillon commented 3 years ago

Yes, why are we doing this? It seems like we're removing a VERY popular feature (the 10th most visited help page according to the OP) just because that feature is slow. But if we remove the feature, people will still want this functionality so they'll just end up implementing it manually anyway, so how are we improving anything by removing this?

jreback commented 3 years ago

there is a ton of discussion pls read in full

this has long been planned as inplace operations make the code base inordinately complex and offer very little benefit

achapkowski commented 3 years ago

@jreback I don't see tons of discussion in this issue, please point me to the discussion that I might be better informed. From what I see is a community asking you not to do this.

MarcoGorelli commented 3 years ago

There's a long discussion here on deprecating inplace: #16529

But if we remove the feature, people will still want this functionality so they'll just end up implementing it manually anyway, so how are we improving anything by removing this?

I'd argue that this is still an improvement, because then it would be clearer to users that this is a slow feature - with the status quo, people are likely to think it's analogous to list.append

What's your use-case for append? What does it do that you can't do without 1-2 lines of code which call concat? If you want to make a case for keeping it, please show a minimal example of where having append is a significant improvement

neinkeinkaffee commented 2 years ago

take

gesoos commented 2 years ago

Any chance we can get a note in the documentation on this?

MarcoGorelli commented 2 years ago

@gesoos agreed, there should probably be a ..deprecated:: note in the docstring - do you want to open a PR for this?

behrenhoff commented 2 years ago

I understand that appending can be inefficient, but we use it in non performance critical code, i.e. I don't care. Appending some artificial data (usually based on the data already in the DF) is a very common use-case for us. And I would like to mention why were are using append all over our code base instead of concat: concat loses all are the attributes while they appending works just fine.

import pandas
df1 = pandas.DataFrame({"a": [1]})
df2 = pandas.DataFrame({"a": [2]})
df1.attrs["metadata-xy"] = 42

print(df1.append(df2).attrs)  # keeps the attrs of df1
print(pandas.concat([df1, df2]).attrs)  # no attrs in result
MarcoGorelli commented 2 years ago

Thanks @behrenhoff - maybe this is a case for concat preserving attrs then? Do you want to open a separate issue for that?

achapkowski commented 2 years ago

I don't think you should deprecate append until there is parity with concat.

ndevenish commented 2 years ago

Just had warning for this this appear all over our output. First thing I checked was going to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html to find the reasons why and whether the suggested replacement was strictly equivalent.

So this appears to be deprecated in code but not the documentation? I can imagine other people also finding this discrepant.

There's a long discussion here on deprecating inplace: #16529

The word "append" only shows on that page as a link to this issue, so one can be forgiven for not finding it when looking for discussion on this deprecation.

MarcoGorelli commented 2 years ago

So this appears to be deprecated in code but not the documentation? I can imagine other people also finding this discrepant.

There's a PR open to add this to the docstring

ndevenish commented 2 years ago

Awesome. Maybe it'd be worth considering adding "Check deprecations are documented" to the release validation process. Most (but not all) of the 1.4.0 deprecation list have them.

wumpus commented 2 years ago

I wrote code that implements append() using a 2-level self-tuning set of accumulators that runs plenty fast without using too much memory. Of course it uses pd.concat() under the hood, and as you can see I ended up finding lots of small differences between the semantics of deprecated df.append() and pd.concact().

I don't think it's a good idea to force all of your end users to write this kind of code.

https://github.com/wumpus/pandas-appender

dkamacharov19 commented 2 years ago

Disagree. Appending a single row is useful functionality and very common. Yes, we understand its inefficient; but as TomAugspurger himself said, this is the 10th most commonly referenced page on the help, so clearly lots of people have this use case of adding a single row to the end. We can tell ourselves we're removing the method to "encourage good design" but people still want this functionality, so they'll just use the workaround of creating a new DataFrame with a single row and concat'ing, but that just requires the user to write even more code to still get the exact same performance hit, so how have we made anyone's life better?

As a pandas user and a novice coder, I don't understand why this comment is being overlooked yet has been upvoted the most. The rationale behind this decision seems arbitrary and appears to ignore a significant contingent of the population that might be using the pandas library. I count myself as a user and would urge you to consider your user base that might not have efficiency in mind when using this function. I can ensure that when utilizing pandas and append, a significant portion of the population does not have computational efficiency in mind. If that was a primary concern, Python would likely not be the language of choice let alone pandas or the append function. There does not appear to be a 1 to 1 replacement when using the concat function as a replacement for append and as another user has already commented, I don't believe it should be deprecated until that is addressed.

TomAugspurger commented 2 years ago

The replacement is to build up a list or dictionary of records ahead of time, and then pass that to pandas.

MarcoGorelli commented 2 years ago

I don't believe it should be deprecated until that is addressed.

It's not been removed yet, for now there's just the warning.

I've opened #45824 for the attrs limitation mentioned above

If people notice other limitations, they can open issues about them, and by the time append will have been removed, they'll have been addressed

dkamacharov19 commented 2 years ago

I understand it is not getting deprecated right away, and I also understand there's a better way to do this. Again, my comment was completely ignored for the sake of making an argument that there are, as the commenter I quoted stated, ways to encourage "good design". Why deprecate a perfectly usable function that is clearly popular with your user base? Not here to debate code with you as I don't code for a living. Just wanted to share an alternative viewpoint to consider. Sometimes the most efficient way is not always the right approach, despite what you might believe will encourage better coding. Why break a function that has widespread usage? Seems counterintuitive to me and again rather arbitrary.

seanbow commented 2 years ago

It took me way too long to figure out that I need to replace

df.append(series)

with

pd.concat([df, series.to_frame().T])

I agree that this functionality is very common and should be included, maybe by a name other than append. it's for a case where I really do need to append just one bar at a time and efficiency isn't very important.

edit: Ok, it's worse. I want to append a dict to the end of a dataframe as a row now and it's going to require some other hacky method when .append() was working just fine.

wumpus commented 2 years ago

I'd like to point out again that I have an efficient, scalable implementation of df.append:

https://github.com/wumpus/pandas-appender

CassWindred commented 2 years ago

This is a very frustrating decision, it is extremely common to have to append individual rows to a Dataframe, or individual elements to a series, and the "intended" way of doing this is much harder to read, requires more lines of code and is far less intuitive. I am not very experienced with Panda's, and now each time I need to add something to a DataFrame I pretty much have to look this up every time, whereas append() is very obvious.

Yes, it may be slower, but for my use case the effect is negligible, and impossible to batch into a single call, at least not without making the code much slower and harder to read.

I've been using Panda's for a few bits and pieces over the last couple of years and 90% of the time I am using Panda's entirely because it gives a bunch of tools and operations that makes it much more painless to work with certain types of data, with easy and intuitive operations to do things that would otherwise take several lines of code to do with raw Python data structures. Very rarely does the efficiency and speed of these oprations matter on any human-perceptible scale. I'm using Pandas to make programming faster, not the program itself, and avoiding append() makes a very common operation an order of magnitude more painful.

Please reconsider this change, people need to append to DataFrames, and they won't stop doing so after the functionality is removed, they will just write more fragile and unintuitive code to do it instead.

ChayanBansal commented 2 years ago

Is the DatetimeIndex.append method also going to be deprecated?

TomAugspurger commented 2 years ago

Does anyone have any non-trivial examples that are worse off after this deprecation? I'm happy to help write docs on the transtion. I have one semi realistic example at https://tomaugspurger.github.io/modern-4-performance.html, where you have a directory of CSV files to concatenate together.

The "bad" way, using append:

files = glob.glob('weather/*.csv')
columns = ['station', 'date', 'tmpf', 'relh', 'sped', 'mslp',
           'p01i', 'vsby', 'gust_mph', 'skyc1', 'skyc2', 'skyc3']

# init empty DataFrame, like you might for a list
weather = pd.DataFrame(columns=columns)

for fp in files:
    city = pd.read_csv(fp, columns=columns)
    weather.append(city)

The "good" way, using concat

files = glob.glob('weather/*.csv')
weather_dfs = [pd.read_csv(fp, names=columns) for fp in files]
weather = pd.concat(weather_dfs)

That's dealing with DataFrame.append(another_dataframe). I gather that some / most of the difficulties expressed here are from workflows where you're appending a dictionary? Anyone able to share a workflow like that?

wumpus commented 2 years ago

I run parameter surveys over millions of combinations, which are represented as a dataframe. The answers come back over 10s of hours, and are appended one by one to an output dataframe.

TomAugspurger commented 2 years ago

@wumpus do you have a minimal example?

wumpus commented 2 years ago

The usecase is hidden in the guts of a middleware package https://github.com/wumpus/paramsurvey -- if you look at the example in the README, you can see what the user sees (results returned as a dataframe)

My code that does df.append() efficiently and scalably is https://github.com/wumpus/pandas-appender

seanbow commented 2 years ago

My use is in collecting financial data online, I get a dict every 5 minutes or so and append it to an existing data frame in memory to do analysis on. Waiting to collect more than one row at a time doesn't make any sense.

ppatheron commented 2 years ago

I use append in a client application where it is very niche - this application is running on production and now I have to update the code to use concat?

I am not an expert Python Programmer, but this way of using append is really useful in my use case, as I need to add the contents of a dictionary, to another dictionary which has a list. But concat does not work for that! I index the dictionary with the list, and then append the contents of the other dictionary into that list.

What will happen now? When will this deprecation happen? So not cool :/

MarcoGorelli commented 2 years ago

I get a dict every 5 minutes or so and append it to an existing data frame in memory to do analysis on

Isn't that relatively straightforward with concat though?

>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> df
   a  b
0  1  4
1  2  5
2  3  6
>>> df.append({'a': 4, 'b': 7}, ignore_index=True)
   a  b
0  1  4
1  2  5
2  3  6
3  4  7
>>> pd.concat([df, pd.DataFrame({'a': 4, 'b': 7}, index=[3])])
   a  b
0  1  4
1  2  5
2  3  6
3  4  7

I need to add the contents of a dictionary, to another dictionary which has a list. But concat does not work for that!

Can you show a minimal reproducible example please?

When will this deprecation happen?

Version 2.0, I believe

So not cool :/

Please be constructive

ppatheron commented 2 years ago

@MarcoGorelli Apologies, not trying to be rude but I am a bit stressed.

So I loop through a DataFrame which contains multiple rows of customer data, which needs to be appended to a JSON/Dictionary object:

ntwrk_src_data_load = {}

                    header_ntwrk = {
                        "header": {
                            "sender": "test",
                            "receiver": "test",
                            "model": "test",
                            "messageVersion": "test",
                            "messageId": "test",
                            "type": "network",
                            "creationDateAndTime": datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%f')
                        },
                        "network": []
                    }

                    ntwrk_src_data_load.update(header_ntwrk)

                    for row in payload_1.itertuples(index=False):
                        ntwrk_src_list_1 = {
                            "creationDateTime": datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%f'),
                            "documentStatusCode": "test",
                            "documentActionCode": "test",
                            "lastUpdateDateTime": datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%f'),
                            "pickUpLocation": {
                                "locationId": row[0]
                            },
                            "dropOffLocation": {
                                "locationId": row[1]
                            },
                            "transportEquipmentTypeCode": {
                                "value": row[2]
                            },
                            "freightCharacteristics": {
                                "transitDuration": {
                                    "value": row[3]
                                },
                                "loadingDuration": {
                                    "value": row[4]
                                }
                            },
                            "sourcingInformation": [
                                {
                                    "sourcingMethod": "",
                                    "sourcingItem": {
                                        "itemId": row[5].lstrip('0')
                                    },
                                    "sourcingDetails": {
                                        "effectiveFromDate": row[6],
                                        "effectiveUpToDate": row[7],
                                        "priority": row[8],
                                        "sourcingPercentage": row[9],
                                        "majorThresholdShipQuantity": {
                                            "value": row[10]
                                        },
                                        "minorThresholdShipQuantity": {
                                            "value": row[11]
                                        }
                                    }
                                }
                            ]

                        }

                        ntwrk_src_data_load['network'].append(ntwrk_src_list_1)
                    json_data_1 = json.dumps(ntwrk_src_data_load_1)

This adds all the contents of the rows which I require to my dictionary, which is then dumped as a JSON format.

I send this JSON file via an API to the client. How would I concat the looped rows into the specific list inside the dictionary as above?

MarcoGorelli commented 2 years ago

I suspect you want something like ntwrk_src_data_load['network'] = pd.concat([ntwrk_src_data_load['network'], pd.DataFrame(ntwrk_src_list_1, index=[0])]) (replace 0 with whatever you want the index of the new row to be), but without a minimal reproducible example (please see here for how to write one) it's hard to say more

ppatheron commented 2 years ago

I've tried what you mentioned, but receive a TypeError:

TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid

I'll try my best with the minimal code:

import pandas as pd from hdbcli import dbapi import numpy as np import hana_ml.dataframe as dataframe import time import math import logging import threading from datetime import datetime

connection = dbapi.connect(address='<<IP>>',port='<<PORT>>',user='<<USER>>',password='<PASSWORD') cursor = connection.cursor()

df = pd.read_sql('''SQL STATEMENT"''', connection) <-- This brings in all the required fields from the DB connection above for a specific table

`ntwrk_src_data_load = {}

                header_ntwrk = {
                    "header": {
                        "sender": "",
                        "receiver": "",
                        "model": "",
                        "messageVersion": "",
                        "messageId": "",
                        "type": "network",
                        "creationDateAndTime": date)
                    },
                    "network": []
                }` <-- the "network" object is the list that I have to populate from the dictionary below.

The contents of the above df is then looped through, and each row is indexed into the JSON/Dictionary structure that the customer requires it in.

`ntwrk_src_data_load_1.update(header_ntwrk)

                for row in payload_1.itertuples(index=False):
                    ntwrk_src_list_1 = {
                        "rows_to_be_populated": row[0]
                    }

                    ntwrk_src_data_load_1['network'].append(ntwrk_src_list_1)
                json_data_1 = json.dumps(ntwrk_src_data_load_1)`

the "ntwrk_src_list_1" is the object that returns multiple "lists" that has to be inserted into "ntwrk_src_data_load_1" object. So essentially, each row in the payload has it's own structure inside the dictionary/JSON file.

MarcoGorelli commented 2 years ago

That's neither minimal nor reproducible, sorry - if you want support for your specific use-case, please read through and follow this https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports and then try again

ppatheron commented 2 years ago

Let's try again:


import pandas as pd

df = pd.DataFrame({'name' : ['Paul', 'Jessie'],
                  'surname': ['Lessor', 'Spolander'],
                  'address': ['61 Gravel Road', '2 Pointer Streer']})

payload_for_loading = {}

header_for_json = {
    "header": {
        "sender": "System",
        "date": "2022-03-14"
    },
    "client": []
}

This "client" list needs to be populated with the DataFrame created above.

payload_for_loading.update(header_for_json)

for row in df.itertuples(index=False):
    payload_dict = {
        "client_name": row[0],
        "client_surname": row[1],
        "client_address": row[2]
    }

    payload_for_loading['client'].append(payload_dict)
json_payload = json.dumps(payload_for_loading)

This code produces the result I require, however, how would the .append function be changed with .concat?

MarcoGorelli commented 2 years ago

Looks like payload_for_loading['client'] is a list? In which case, your code will continue working as usual

It's DataFrame.append that's being deprecated, not append of the Python built-in list

ppatheron commented 2 years ago

Perfect - apologies for any confusion, and thank you so much for your assistance. I noticed that my logging is using the DataFrame.append, and not my payload code. I will still have to update my code to use concat but I've already tested that and it's working.