mailing list - Githubissues

rdp commented 10 years ago

Hey, I didn't see an easy way to 'ask a question' or 'suggest a feature' might be nice to have a mailing list of some type for that purpose created. Cheers! -roger- [the feedback was "dang, I really wanted it to at least be an option to automatically create new partitions, what we want to have is basically oracle's interval partitioning but for postgres"] Thanks!

keithf4 commented 10 years ago

Github's Issues section is where you can ask questions or suggest features. It's been working well for me so far and keeps things centralized.

Not sure I follow your feedback. This extension does automatically manage the creation of new partitions for time & serial intervals.

rdp commented 10 years ago

OK you may perhaps want to mention in the readme something like "feedback? post use github issues" or something like that :)

re: my other feedback (thanks for your responses BTW)

the confusion was from this page-- http://pgxn.org/dist/pg_partman/doc/pg_partman.html (I guess from the pg_partman.md file...)

"If you attempt to insert data into a partition set that contains data for a partition that does not exist, or is not covered by the partitioning trigger, that data will be placed into the set's parent table."

Then the README suggested the use of run_maintenance on a crontab [my initial response--does this extension do dynamic creation at all then?]

Between the two of them it left me wondering if dynamic partitions were supported out of the box or not. I'd be happy to try and clarify and add a pull request, if that would help. Thanks!

keithf4 commented 10 years ago

If you read beyond that sentence you quoted, a reason is given for why I do not have it automatically creating partitions outside the covered range. For the majority of people doing interval partitioning, data is being ingested at a predictable rate for a known period of time, so it avoids a bigger problem if unpredicted data is suddenly ingested. Of course there are exceptions to this, but I'm designing for the common case and provide capabilities to monitor for data coming in outside of the cases I cover.

Scheduling calls to run_maintenance() is required for time based partitioning and optional for serial based partitioning. I'm trying to see if I can make it optional for time-based as well, but it's trickier than it was for serial. PostgreSQL does not have any sort of internal scheduling system, so it does require running it via a third-party scheduler at this time (cron, task manager, etc). So this is an issue for any scheduled task in Postgres, not just with this extension. PostgreSQL 9.3+ has a new feature called background workers that can allow an internal scheduled process to run and I do plan on adding that in the near future (it's a feature request issue in github).

keithf4 commented 10 years ago

Also, if you see issue https://github.com/keithf4/pg_partman/issues/19 you can see that creating new partitions automatically via the trigger can create contention and other issues for tables with a very high ingestion rate. So that's why run_maintenance() was made optional for serial partitioning.

keithf4 commented 10 years ago

Just checking to see if I was able to answer your questions.

rdp commented 10 years ago

I was kind of hoping for a clarification in the README and/or docs so that other newbie beginners like myself don't get confused into thinking it can't provide auto new partitions, FWIW. Cheers1

keithf4 commented 10 years ago

I'll see what I can add. Hopefully will have it automatic for newer versions of PostgreSQL soon. :)

rdp commented 10 years ago

OK I might submit another pull request of a few more documentation suggestions, feel free to not merge it again :)

Since this is the "mailing list" thread... a few more questions/comments:)

If I had one feature request it would be an added option "always create new partitions" when using id-dynamic (never default back to inserting in the master). or at least an option to raise an exception if something would be heading back to the master table, for those of us paranoid about accidentally getting too full in the MASTER. Similar to how oracle interval partitioning behaves [in this instance, we're trying to migrate to Postgres from Oracle...]

The fear I have here is that if my p_control column (for inserts) were to somehow accidentally "skip" from being early in one child table, to being early in (an uncreated child table) then the new child create block would be missed from then on. I know it'd be rare, just for peace of mind.

That's about it. I did notice that when you insert into it, it returns this:

@pg_partman: Query returned successfully: 0 rows affected, 11 ms execution time.

(which affects batch updates since they expect this value to be equal to the number of rows inserted). I had to go to some weird shenanigans to get this working in a different example: http://stackoverflow.com/a/26123701/32453

OK and a few questions that will show my newbieness...

When you have id-dynamic, does it avoid the possibility of two simultaneous queries creating "the new partitions" and one of them failing from some kind of race condition?

Using id-dynamic which generates a trigger like this: CREATE OR REPLACE FUNCTION public.streamreftotape_part_trig_func() ... It does a check_name_length check once per insert, isn't that a bit often [if it doesn't affect performance then I guess it doesn't matter much anyhow, just wondering].

Thanks and cheers! -roger-

keithf4 commented 10 years ago

When using triggers to create new tables on demand based on the data of the row, you can run into race conditions. So, I will not be adding any options to create partitions as needed at this time. Using RULES also has more problems than using trigger, which is why I went with triggers. This blog post explains in more detail the race conditions & problems with RULES https://securelist.com/blog/research/66988/tyupkin-manipulating-atm-machines-with-malware/

Other databases that have partitioning built in (MySQL, Oracle, etc) are handling the creation of new tables at a much lower level than triggers would run at so they can avoid race conditions like that blog post mentions. I would imagine when partitioning is built into PostgreSQL it will as well. This is why I did the premake option. If you anticipate future data being inserted, then you can just set the premake value higher to account for it. The only issue here is with static partitioning, you're adding more IF conditions to the trigger that may have to be checked. A premake value down around 10-20 would probably be fine. Getting up around 100 or more, you'd probably start seeing performance issues when you insert data that is further down the IF condition chain. This is when a dynamic trigger may be better, but then you run into the inherent performance loss of doing EXECUTE statements vs straight SQL.

If you check the "Maintenance" functions section of the documentation file, you will see there is a function check_parents() that you can use to monitor for any data getting inserted into parent tables. I'm likely not going to add any additional code to prevent data going into the parent, especially code that would throw an exception. I'd rather have the data inserted to the parent and not lost. It's easier to clean up unwanted data than try and retrieve data that may have been lost because an insert failed. If you get data inserted outside of the time period that the trigger is covering (for static), or for a child table that does not exist (for dynamic), it will just go into the parent. Use the above function to monitor for that occurring and you should be able to catch it before it causes a problem. If there's no data in any parent tables, it's a very fast function to run. If the table that data goes into does not exist, then the partitiondata* functions & python scripts will create the table that's needed. This will not affect the automatic creation of new tables by run_maintenance(). If a future table is created earlier than anticipated, then run_maintenance() will see it already exists and just skip creating it. But it will set the trigger function appropriately.

Yes, the issue of not getting a row count returned on insert is a problem. I've not found a way around it other than a trick that allows the insert to go to the parent, does another insert to the child then deletes the row from the parent. But this causes 3 DML statements for every one insert, so the performance issues can be significant. The view method could also work (like in your stackoverflow link), but that is harder to program in a general manner for something like pg_partman and I haven't had the time to test it out more. I don't think this will really be resolved well until partitioning is built into PostgreSQL.

As I said before, see https://github.com/keithf4/pg_partman/issues/19 for what happens when multiple queries are trying to create new tables in id-dynamic. It doesn't cause error or duplications, it causes lock contention.

Yes, I have to do the check_name_length() on every insert because, if postgres is truncating the name of your tables, it's the only way I can reliably determine what that child table name is supposed to be. It shouldn't affect performance that much, but this is yet another downside of using dynamic triggers for partitioning. I'll actually double-check and see if I can make check_name_length a STATIC or IMMUTABLE function, which can help performance. I'd highly recommend sticking with static partitioning if the majority of your data ingestion is going to be current data. If data going into the parent is only an accident, you can monitor for it and deal with it at that time and keep the performance benefit of static triggers. Program for the normal case and plan for handling the outliers. I believe I've provided the means to do that.

Doing trigger based partitioning with table inheritance is no where near as good as partitioning built into other RDBMS systems and I don't ever expect it to be. And there are caveats like those above that have to be dealt with. There are most certainly other methods than the way I did it, but I chose to go with the methods that create the least risk of data loss.

Hope that helps. FYI, I will be closing this thread whenever I get the next version released. I'm leaving it open as a reminder that I want to review the documentation to address your concerns and hopefully make things clearer.

rdp commented 10 years ago

Thanks for your responses. I'm learning a lot for sure.

keithf4 commented 9 years ago

I've updated the documentation in 1.8.0 so hopefully things are clearer.

rdp commented 9 years ago

thank you.

On Thu, Jan 15, 2015 at 12:27 PM, Keith Fiske notifications@github.com wrote:

I've updated the documentation in 1.8.0 so hopefully things are clearer.

— Reply to this email directly or view it on GitHub https://github.com/keithf4/pg_partman/issues/34#issuecomment-70145580.

pgpartman / pg_partman

mailing list #34