pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.19k stars 17.77k forks source link

Deprecate Series / DataFrame.append #35407

Closed TomAugspurger closed 2 years ago

TomAugspurger commented 4 years ago

I think that we should deprecate Series.append and DataFrame.append. They're making an analogy to list.append, but it's a poor analogy since the behavior isn't (and can't be) in place. The data for the index and values needs to be copied to create the result.

These are also apparently popular methods. DataFrame.append is around the 10th most visited page in our API docs.

Unless I'm mistaken, users are always better off building up a list of values and passing them to the constructor, or building up a list of NDFrames followed by a single concat.

wumpus commented 2 years ago

@MarcoGorelli I'm not sure what you think is constructive, but I've mentioned repeatedly that I have a scalable wrapper that preserves df.append semantics, while preventing everyone having to independently write the same code. Am I not being constructive?

MarcoGorelli commented 2 years ago

Hey @wumpus ,

My "please be constructive" was in response to the comment "So not cool :/" and wasn't directed at you

Thanks for your input and for a link to your wrapper

jreback commented 2 years ago

@wumpus that was in reference to another conversation (not you)

your wrapper is not incorporated to pandas / likely won't be in any event

wumpus commented 2 years ago

I agree that I appear to be wasting my time, despite having a solution to the root problem. What am I doing wrong?

jreback commented 2 years ago

@wumpus what u wrote might be fine for your use but it's not going to be possible to do this lazy type of evaluation in a reliable way in pandas itself

sure it could be done but would lead to a large amount of edge cases that would lead to a very brittle / complex soln

wumpus commented 2 years ago

My code is fully lazy. I agree that there are probably edge cases -- easy to see because append() and concat() are wildy different.

lazypandabear commented 2 years ago

I wish not to deprecate series / dataframe.append. There are scenarios in my code that I could not done using pd.concat. For Example, I created a list of records that has missing series, and it requires me to do groupby for me to be able to identify them. Then I created a list of those records and then iterate each of those and append the missing series using df.append. I cannot find a way on how to do this with pd.concat.

MarcoGorelli commented 2 years ago

Usual response - please provide a minimal reproducible example

rahilbhansali commented 2 years ago

My two cents on concat vs append (since I use it quite extensively in my algotrading platform):

  1. Append has been incredibly useful for me and I've used it in probably 12-15 places in my codebase. I use dataframes to load price data into memory for fast compute and at times need to append new rows (e.g. orders placed) to a dataframe. Given the size, I use Dataframes almost as a replacement for lists since its far more nimble.

  2. Append until now - allowed me to quickly and easily add a dictionary to an existing dataframe. Concat now requires me to create a dataframe with 1 or more rows and then concat it with my existing dataframe vs. simply just adding a dictionary to the existing dataframe.

Concat seems to be a vertical merge of two dataframes (extend rows) vs. merge which horizontally merges two dataframes (i.e. extends columns basis common keys). If anything - concat intuitively does not suggest appending so here's what I propose:

  1. Deprecate Append if its indeed slower (but...)
  2. Allow concat to add dictionaries to the dataframe (along with support for arrays). Also instead of doing pd.concat - why can't we simply do df.concat([df1, df2]) which adds data from df1 and df2 to df?
  3. Rename Concat to Append - honestly append is a more intuitive word used across the board - I can choose to append from another df or rows directly.

As for an example: Previously I use to use this: self.df_balances = self.df_balances.append(trade_date_balance.to_dict(), ignore_index=True)

Now its replaced with (a little annoying): new_balances_row_df = pd.DataFrame(trade_date_balance.to_dict(), index=[0]) self.df_balances = pd.concat([self.df_balances, new_balances_row_df], ignore_index=True)

For context - df_balances is a dataframe I maintain to save daily balances during my backtesting engine runs which allows me to compute funds available for investing. As I loop through my backtesting dates, I keep inserting this into the dataframe at the end of the day so I can quickly access it later when needed. Eventually, I output the df into a csv so that I can manually verify there is no calculation or settlement error (from a funds perspective).

I do use .loc to make updates - however, it isn't intuitive because you need to know the index or the label - which honestly doesn't matter when you append - and from my knowledge - I don't think .loc supports adding a dictionary.

erfannariman commented 2 years ago
  1. Allow concat to add dictionaries to the dataframe (along with support for arrays). Also instead of doing pd.concat - why can't we simply do df.concat([df1, df2]) which adds data from df1 and df2 to df?

I think allowing concat to add dictionaries is a fair point, since it is mentioned multiple times in this topic. Not sure about df.concat([df1, d2]), it's just as easy to use pd.concat([df, df1, df2]).

rahilbhansali commented 2 years ago

@erfannariman - agreed - its not hard. But merge also uses the same lingo - df.merge(df1), since concat is just a merger of rows from two dfs (in some sense) - might as well stick to the same writing style as merge?

Not a big one - but was a comment for consistency.

marc-moreaux commented 2 years ago

I feel like code readability is so much better with append that concat. I understand that append is not in-place and that it is less efficient than concat.

Even though: append feels more pythonic to me that concat does.

I often use it with single row dictionaries, Series or DataFrames and I feel that my code is more readable this way... Would it make sense to get new appends like:

MarcoGorelli commented 2 years ago

Would it make sense to get new appends like:

-1 on adding even more methods to the API, and very confident that there'd be broad consensus on this among pandas devs

Examples of how to do these, though, would be good candidates for the docs Tom said he'd help write

I understand that append is not in-place and that it is less efficient than concat.

If you're just appending a single row, there shouldn't be much difference in efficiency. If you're appending multiple, then that's where append encourages inefficient code, which is why it's been deprecated. Here's an example from the awesome library ArviZ where the append deprecation "forced" them to write better code: https://github.com/arviz-devs/arviz/pull/1973/files

achapkowski commented 2 years ago

@MarcoGorelli so the question is: is the deprecation being reconsidered?

MarcoGorelli commented 2 years ago

No, what makes you think that?

There can be docs to help the transition (which you'd be welcome to help out with, see the contributing guide if you're interested)

achapkowski commented 2 years ago

@MarcoGorelli clearly the community is saying this is bad. What will it take to stop this?

MarcoGorelli commented 2 years ago

What will it take to stop this?

I'd suggest starting with a minimal reproducible example indicating why you think append needs to stay

achapkowski commented 2 years ago

Does this meet the needs @MarcoGorelli of a simple sample?

append is simple to understand everyone knows list().append. pd.concat is more like list.extend. Though for pushing lots of data, extend is better on a list, for one row, append is fine. The dev team is pushing everyone to go to the extend like method on a list.

What we have and should stay:

import pandas as pd
data = [{'simple' : 'example 1'}, {'simple' : 'example 2'}, {'simple' : 'example 3'}]
pd.DataFrame(data).append({'simple' : "example 4"}, ignore_index=True)

Now let's append with concat:

Example 1: Error

df = pd.DataFrame(data)
pd.concat([df, {'simple' : "example 4"}])
Traceback (most recent call last):
  Python Shell, prompt 18, line 1
    # Used internally for debug sandbox under external interpreter
  File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-dev\Lib\site-packages\pandas\core\reshape\concat.py", line 295, in concat
    sort=sort,
  File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-dev\Lib\site-packages\pandas\core\reshape\concat.py", line 370, in __init__
    raise TypeError(msg)
builtins.TypeError: cannot concatenate object of type '<class 'dict'>'; only Series and DataFrame objs are valid

Example 2: Error

df = pd.DataFrame(data)
df.concat([{'simple' : "example 4"]) # method doesn't exist

Example 3: I need to create a whole new dataframe for 1 row

df = pd.DataFrame(data)
df1 = pd.DataFrame(data=[{'simple' : 'example 4'}])
pd.concat([df, df1]) # no error finally

Output:

pd.concat([df, df1])
      simple
0  example 1
1  example 2
2  example 3
0  example 4

A bit of a note on example 3, the pd.concat is a method within Pandas, not on the object whereas append is right on the DataFrame. We have overhead for 1 row creating a dataframe. This seems like overkill. Plus now I have to reset my index with concat.

So if I were to break it down:

  1. append exists on the dataframe and is common function used throughout the python ecosystem
  2. pd.concat method doesn't exist on the DataFrame. That means a user has to search for the function.
  3. Adding a single row requires you to create a dataframe, Users cannot just push a dictionary.
  4. Both methods have a place in the API, keep them both and instruct users when one is better over the other.
  5. pd.concat causes users to have to manage the indexes themselves. append will increase to the next iteration.
MarcoGorelli commented 2 years ago

You can do

>>> pd.concat([pd.DataFrame(data), pd.DataFrame({'simple': 'example 4'}, index=[len(data)])])
      simple
0  example 1
1  example 2
2  example 3
3  example 4

which doesn't seem more complicated that using append

>>> pd.DataFrame(data).append({'simple' : "example 4"}, ignore_index=True)
<stdin>:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
      simple
0  example 1
1  example 2
2  example 3
3  example 4

append is simple to understand everyone knows list().append

Yes, that's exactly the issue - to quote the original post: "They're making an analogy to list.append, but it's a poor analogy since the behavior isn't (and can't be) in place. The data for the index and values needs to be copied to create the result."

phofl commented 2 years ago

We have overhead for 1 row creating a dataframe. This seems like overkill. Plus now I have to reset my index with concat.`

  1. Creating a DataFrame is exactly what happens under the hood -> no overhead with concat
  2. You can simply set ignore_index=True for concat, no need to call reset_index
achapkowski commented 2 years ago

I wasn't looking for solutions for my example... I knew this is what would happen...

Look I know there are ways around this, but why not just make append do what the concat method in the background? Keep both, you your functionality and the community gets to keep a well known function name?

It seems like a solid ask and compromise.

MarcoGorelli commented 2 years ago

but why not just make append do what the concat method in the background?

It already does:

https://github.com/pandas-dev/pandas/blob/3b163de02f666a2342e18468cba7d6c286f526bf/pandas/core/frame.py#L9253-L9310

The issue isn't for when you're appending a single row, but for when you're appending many (e.g. in a loop) - in that case, having append encourages bad and inefficient code

Example:

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(range(10_000))
   ...: dfs = [df] * 100

In [4]: %%timeit
   ...: df_result = dfs[0]
   ...: for df in dfs[1:]:
   ...:    df_result = df_result.append(df)
   ...: 
1.39 s ± 51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %%timeit
   ...: df_result = pd.concat(dfs)
   ...: 
   ...: 
3.6 ms ± 76.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
achapkowski commented 2 years ago

Then just state the purpose the method in the doc. You are shoehorning in all use cases into one method, when two methods are fine.

martin-martin commented 2 years ago

I haven't seen the impact on chaining style pandas code mentioned in the discussion above (maybe it's discussed elsewhere?), so here's what I'm wondering:

Writing Chaining Style pandas Code

Deprecating pandas.DataFrame.append() will remove a seemingly intuitive possibility to add a row (or rows) to a data frame while writing pandas code in a chained style:

fruits = pd.DataFrame(
    {
    "name": ["apple", "pear", "avocado"],
    "image": ["🍏", "🍐", "🥑"]
    }
)

veggies = pd.DataFrame(
    {
    "name": ["tomato", "carrot", "avocado"],
    "image": ["🍅", "🥕", "🥑"]
    }
)

both_fruit_and_vegetable = (
    fruits
    .append({"name": "tomato", "image": "🍅"}, ignore_index=True)  # Forgot the tomato is a fruit, too!
    .merge(veggies)
    # ... Add other chained operations
)

print(both_fruit_and_vegetable)

# OUTPUT:
#
#       name image
# 0  avocado     🥑
# 1   tomato     🍅

I'm not sure how often you'd want to add rows to a data frame like this, and I understand you could achieve the same using pandas.DataFrame.merge(), e.g. in this minimal example:

both_fruit_and_vegetable = (
    fruits
    .merge(pd.DataFrame({"name": ["tomato"], "image": ["🍅"]}), how="outer")
    .merge(veggies)
)

I'm showing the merge functionality as an example method also because pandas has the instance-level method pandas.DataFrame.merge() as a wrapper for the lower-level pandas.merge().

I thought that this wrapper exists to make chained-style pandas possible for merge operations (and at least a few others think so too), but please correct me if I'm wrong.

Alternatives to Using .append() for Chaining Syntax

So I'm wondering whether there's a suggested alternative for adding a row to a data frame when writing chained style pandas code.

Is the solution to use pandas.DataFrame.merge() with appropriate parameters, or will non-SQL-wizards run into unexpected join behavior that's harder to wrap your head around than a seemingly more straightforward append/concat style concatenation?

Or could it be useful to add an instance-level pandas.DataFrame.concat() method that uses pandas.concat() internally, but opens up the opportunity to chain the operation to other operations using a familiar syntax?

Thanks for your thoughts and work!

MarcoGorelli commented 2 years ago

First of all, that's a great example, thanks!

Though can't concat fit into the chain?

In [7]: both_fruit_and_vegetable = (
   ...:     pd.concat([fruits, pd.DataFrame({'name': ['tomato'], 'image': ["🍅"]})], ignore_index=True)
   ...:     .merge(veggies)
   ...:     # ... Add other chained operations
   ...: )

In [8]: both_fruit_and_vegetable
Out[8]: 
      name image
0  avocado     🥑
1   tomato     🍅
martin-martin commented 2 years ago

Lol, thanks 😋

Your example works in this specific case, where .append() is the first thing I do. But it doesn't work when I'd want to concat somewhere lower down in the chain, e.g.:

both_fruit_and_vegetable = (
    fruits
    .merge(veggies)    
    .append(pd.DataFrame({'name': ['tomato'], 'image': ["🍅"]}), ignore_index=True)
 )

I can't chain pd.concat() onto a previous chain link, which is possible with df.append()

shoyer commented 2 years ago

You can also use .pipe() for method chaining with arbitrary functions.

MarcoGorelli commented 2 years ago

Sure but you can still fit concat into the chain:

both_fruit_and_vegetable = pd.concat(
    [fruits.merge(veggies), pd.DataFrame({"name": ["tomato"], "image": ["🍅"]})],
    ignore_index=True,
)

Or indeed, as suggested above:

fruits.merge(veggies).pipe(
    lambda df: pd.concat(
        [df, pd.DataFrame({"name": ["tomato"], "image": ["🍅"]})], ignore_index=True
    )
)

If you just need to append a single row, then such workarounds should be fine. If you need to append many rows inside a loop, then not having append will at least not encourage inefficient code

dkamacharov19 commented 2 years ago

Sure but you can still fit concat into the chain:

both_fruit_and_vegetable = pd.concat(
    [fruits.merge(veggies), pd.DataFrame({"name": ["tomato"], "image": ["🍅"]})],
    ignore_index=True,
)

Or indeed, as suggested above:

fruits.merge(veggies).pipe(
    lambda df: pd.concat(
        [df, pd.DataFrame({"name": ["tomato"], "image": ["🍅"]})], ignore_index=True
    )
)

If you just need to append a single row, then such workarounds should be fine. If you need to append many rows inside a loop, then not having append will at least not encourage inefficient code

Why is the goalpost constantly being moved here? You requested examples and they have been provided as is demonstrated here. And the answer is to use a workaround why exactly? If append works as intended shouldn’t that be the goal? I have pretty much accepted that the powers to be are not going to listen to feedback as you have convinced yourselves that a problem that doesn’t need fixing or arguably doesn’t even exist needs to be addressed. The solution to bad code is not to remove a tool that has been misused. Simply trying to point out that your reasons are misguided is despite your good intentions.

wumpus commented 2 years ago

pd.concat with a single row at a time is the performance problem.

And as a reminder, I have a demonstration of a high-performance append.

MarcoGorelli commented 2 years ago

DataFrame.append makes a poor analogy to list.append, but it's a poor analogy and it encourages inefficient code.

The purpose of asking for minimal reproducible examples was to see if anyone had a use-case for which there wasn't a simple workaround.

You're all being listened to, I've read every post in this thread. The arguments for keeping append seem to be:

None of these strike me as strong enough reasons to keep append:

And as a reminder, I have a demonstration of a high-performance append.

You've already advertised your package here 3 times, please stop

wumpus commented 2 years ago

I was hoping to successfully talk to "the powers that be" about this change. Looking at the repo owners I see that you are the person I wanted to talk to! Glad I was able to get my code in front of you for a review.

behrenhoff commented 2 years ago

Phew, so many new messages to this topic.

First of all, for me there are two points: it is such a common function that it breaks A LOT of code. This is really bad even if the append pattern is a bad one. Does it really hurt so much to keep it? It costs developers a lot of time to remove all the append calls. I love backward compatibility and I think breaking it for no good reason other than "we want to force developers to do it differently" is a very bad idea.

In our code base, we finally managed to remove all append calls, usually replacing the whole function with better code. When we started with Pandas and didn't know how to work efficiently with Pandas, a common pattern was using "manual groupbys", i.e. loop over df.some_column.unique(), apply the selection like df_group = df[df.some_column == value], do the calculation on the group, append to a result. Very bad indeed. My whole point is that this doesn't improve at all when only replacing the append call with concat. Rewriting these loops with an array to collect the DFs and calling concat at the end gets around the deprecation but doesn't fix the whole style of the function. And sometimes it is even more difficult to fix the old code where an experienced pandas developer would only think "WTF". So fixing all these things is a lot of work for no good reason (the old WTF code was tested and working correctly).

@MarcoGorelli wrote:

Thanks @behrenhoff - maybe this is a case for concat preserving attrs then? Do you want to open a separate issue for that?

Actually, I am in favor of getting rid of as many attrs in our code base as possible. I don't like them at all, they were getting used all over the place so that testing became difficult (when every function expects 10 different attrs to exist, you are in hell and your function become less reusable). Therefore we got rid of a lot of attrs. And concat discurages attrs. But yeah, that was another bit of work. So my code base is now free of attrs and free of append. Work done.

Look, I understand getting rid of some old functions is sometimes a good idea but I really really don't like removing such a common function.

None of these strike me as strong enough reasons to keep append:

the workaround above are simple enough and also legible

Simple enough? That's only true if you don't fix the whole thing. If you just replace every append call with a concat call, you win absolutely nothing.

plenty of people do care about pandas performance

So? I don't understand this argument. df.groupby(col).apply is slow as well and not removed. Also: does append affect other functions? Is concat for two dfs faster than append? No? Only if you do multiple appends? But then you need to modify your algorithm (for example collections separate DFs in a list). Are there really cases where append is a problem? My point is: when you replace it with concat, it won't have an impact on the performance unless you change the whole logic. I DO care about performance in Pandas as well - but ONLY in the areas that affect me. Building/appending to a DF is not in the list at all. If you do care about the append aspect, use a better solution for that purpose. (a bit of whataboutism: a lot of groupby functions are slow as hell when there are many groups, that's where I care)

I thought about clicking the reply button since the deprecation is already in, so this post doesn't change anything - but I feel really strong about "keeping compatibility". I want to be able to update pandas without worrying too much.

By the way: how does concat improve this code:

total_df = pd.DataFrame()
for file in glob("*.csv"):
    print(f"reading {file}")
    df = pd.read_csv(file)
    total_df = total_df.append(df).drop_duplicates()

Yes, it is easy to replace:

total_df = pd.DataFrame()
for file in glob("*.csv"):
    print(f"reading {file}")
    df = pd.read_csv(file)
    total_df = pd.concat([total_df, df]).drop_duplicates()

But the performance gain is 0.

Note that this doesn't work (too much RAM usage) - so you cannot blindly rewrite all df.append to use a list and concat at the end:

dfs = []
for file in glob("*.csv"):
    print(f"reading {file}")
    df = pd.read_csv(file)
    dfs.append(df)
total_df = pd.concat(dfs).drop_duplicates()

Note that append is orders of magnitude faster than read_csv in this example. No performance impact at all. Just work to remove the append calls. (and yes, our real code uses a slightly smarter algorithm)

Having seen the examples in this thread, I would even argue that append is a strong code smell in all cases. It's a question of priorities - compatibility vs. trying to enforce a better style. Especially as a new Pandas user you want to append to your toy DF. This should - in my optinion - be an easy task. The append is only a performance problem if you do it over and over again, not in the general case where you only append one DF to another. That's a very big difference.

So at the end a TLDR:

wumpus commented 2 years ago

There's a standard database algorithm to speed up appending single rows at a time to a database, that's what pandas-appender uses. That relieves Pandas users from having to make smart changes.

In 2010 I had a 30 petabyte homegrown NoSQL database using this algorithm at my search engine startup.

jreback commented 2 years ago

@behrenhoff

By the way: how does concat improve this code:

total_df = pd.DataFrame() for file in glob("*.csv"): print(f"reading {file}") df = pd.read_csv(file) total_df = > total_df.append(df).drop_duplicates()

Yes, it is easy to replace:

total_df = pd.DataFrame() for file in glob("*.csv"): print(f"reading {file}") df = pd.read_csv(file) total_df = pd.concat([total_df,

this is exactly the reason append is super problematic we have an entire doc note that i guess no one reads that explain as you are doing an exponential copy here (no kidding u run out ram)

so you have proved the point why append is a terrible idea - it's not about readability but easy to fall into traps that are non obvious at first glance

wumpus commented 2 years ago

If only there was a well-known algorithm which was not an exponential copy.

behrenhoff commented 2 years ago

this is exactly the reason append is super problematic we have an entire doc note that i guess no one reads that explain as you are doing an exponential copy here (no kidding u run out ram)

You did not read or not understand what I was saying. The version with append is the one that WORKS, the one with concat at the end runs into memory issues (because there is the small drop_duplicates in the loop that fixes the problem and cannot be moved out).

And yes, you can be smarter, for example ((file1 + file2).drop_dups + (file3 + file4).drop_dups).drop_dups or similar - where + can be concat or append - doesn't matter. I was just proving the point that the suggested way "collect all DFs in a list and concat them all at the end" does not always work.

MarcoGorelli commented 2 years ago

Thanks @behrenhoff , that's a nice example - though can't you still batch the concats? Say, read 10 files at a time, concat them, drop duplicates, repeat...


This seems like a perfect summary of the issue anyway:

it's not about readability but easy to fall into traps that are non obvious at first glance


At some point we should lock the issue, this is taking a lot of attention away from a lot of people, there's been off-topic comments, no compelling use-case for keeping DataFrame.append, and strong agreement among pandas devs (especially those who have been around the longest)

behrenhoff commented 2 years ago

Say, read 10 files at a time, concat them, drop duplicates, repeat...

Yes, that would work. So would 1 million other solutions. In practice, I could even exploit more about the date ordering inside of the files (all files here have a rather long overlapping history, but newer files can overwrite (fix) data in older files, so it is of course a drop_dups with a subset and keep=last). My point is: this is a non-issue because the operation is done once per 6 month or so, the daily operation just adds exactly one file. No point in optimizing this further as long as it works. That is the whole point I was trying to make. You force people to optimize / change code where old code just works and where there is no need to modify it. And the real gains in this example are not in append vs concat but in exploiting knowledge of the input files and reading them in different order or in groups.

Note that I am not saying this is a usecase that can only be done with append. I am saying it that removing a common feature is unnecessary work imposed on many people and that you don't get performance gains for free by only replacing append with concat (you need to do more).

Anyway, end of discussion for me. I already did the work and got rid of all my appends.

I just fear that many people will not upgrade if their code breaks. You are also making it harder for new users. append is a good and common English word, concat is not, at least I can't find it in a dictionary (there is concatenate but it is a word that a lot fewer people know - this might not be a problem for native English speakers though). I would always search for "append", not for "concat" if I didn't knew the proper function name.

PolarNick239 commented 2 years ago

Hi, minimal reproducer that was totally broken:

Before:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": 3}, index=[0])
for rowIndex, row in b.iterrows():
    print(a.append(row))
# Output:
#    A    B
#0  1  2.0
#0  3  NaN

After:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": 3}, index=[0])
for rowIndex, row in b.iterrows():
    print(pd.concat([a, row]))
# Output:
#     A    B    0
#0  1.0  2.0  NaN
#A  NaN  NaN  3.0

Also, please, note that if you add deprecation warning in such popular method that is used widely and calls many times per second - this message will be spammed a lot leading to much bigger overhead than you have with allocations and memory copying. So it is beneficial to print such message only on first call.

phofl commented 2 years ago

What are you trying to do? It would be way more efficient to call

pd.concat([a, b], ignore_index=True)

Edit: Or was it on purpose to put A into the Index instead as a column?

PolarNick239 commented 2 years ago

I know, this is just an illustration. I was iterating over rows and if row is OK - adding it to another table. I believe that there are much better way via masking and concatenation with taking such masks into account, but I wanted to have code as simple as possible.

phofl commented 2 years ago

Thanks for your response. It is important for us to see usecases that can not be done more efficiently in another way. You are right, checking data can be done way more efficiently via masking and the concatenating the result.

PolarNick239 commented 2 years ago

How can I concat such row to another table a (with superset of row's column names) in such case?

MarcoGorelli commented 2 years ago

with

pd.concat([a, row.to_frame().T], ignore_index=True)
phofl commented 2 years ago

You can simply do:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": [3, 4]})

result = pd.concat([a, b.loc[b["A"] > 3]], ignore_index=True)

Just change the greater 3 to a condition that suits your needs. This avoids the iterating over the rows step. If you have to iterate for some reason, you can use the example from @MarcoGorelli

PolarNick239 commented 2 years ago

Not all conditions and not every logic can be readable with such single-line expression.

For people who like me want to just get rid of warnings:

import pandas as pd
def pandas_append(df, row, ignore_index=False):
    if isinstance(row, pd.DataFrame):
        result = pd.concat([df, row], ignore_index=ignore_index)
    elif isinstance(row, pd.core.series.Series):
        result = pd.concat([df, row.to_frame().T], ignore_index=ignore_index)
    elif isinstance(row, dict):
        result = pd.concat([df, pd.DataFrame(row, index=[0], columns=df.columns)])
    else:
        raise RuntimeError("pandas_append: unsupported row type - {}".format(type(row)))
    return result
wstomv commented 2 years ago

Here is a use case for Data.Frame.append, that I think makes sense and for which it took me way too long to figure out how to replace it with pandas.concat. (Do note that I am not a seasoned pandas user.)

I have a data frame with numeric values, such as

df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

and I append a single row with all the column sums

totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals)

Simple enough. Here are the values of df, totals, and df_append

>>> df
   A  B
0  1  2
1  3  4

>>> totals
A    4
B    6
Name: totals, dtype: int64

>>> df_append
        A  B
0       1  2
1       3  4
totals  4  6

Now, using pd.concat naively:

df_concat_bad = pd.concat([df, totals])

which produces

>>> df_concat_bad
     A    B    0
0  1.0  2.0  NaN
1  3.0  4.0  NaN
A  NaN  NaN  4.0
B  NaN  NaN  6.0

Apparently, with df.append the Series object got interpreted as a row, but with pd.concat it got interpreted as a column. You cannot fix this with something like axis=1, because that would add the totals as column.

Fortunately, in a comment above, the implementation of DataFrame.append is quoted, and from this one can glean the solution:

df_concat_good = pd.concat([df, totals.to_frame().T])

which yields the desired

>>> df_concat_good
        A  B
0       1  2
1       3  4
totals  4  6

I think users need to be aware of such subtleties. I also posted this on StackOverflow.

MarcoGorelli commented 2 years ago

This was brought up in https://github.com/pandas-dev/pandas/issues/35407#issuecomment-1092892819 , and some other comments in this thread, and would/should be part of the transition docs (see https://github.com/pandas-dev/pandas/issues/46825)

javiertognarelli commented 2 years ago

Worst idea I've seen, why complicate something so easy, I think it's better to have more options/ways to do something than just one strict way. Dataframe.append() was very easy for noobies to add data to a dataframe