I am trying to understand what you guys are doing and I think I can make this suggestion, reasonably, based on what I am seeing now.
It appears that you all build up the filings table by pulling part of the data out of the FILER_FILINGS_CD table. You only use filings for 3 forms, the 450, the 460 and the 497. And then you mark the early amendment filings with the is_duplicate flag. But then you also set the is_duplicate flag for the rows where the PERIOD_ID does not match to your table for the periods.
You do not delete the rows marked with is_duplicate but I do not see in other code where you filter these rows out of results. But I may just be not seeing it. Regardless, it appears you are keeping the early amendment rows in the same table as the others.
I wanted to suggest an approach I was working on, which is to partition all tables that contain a FILING_ID by several criteria. The first criteria was the early amendment. So, for example, I created a "expn_lows" table next to my "expn" table and moved all early amendment data to that table so that those rows would not appear in the "expn" table.
First, I add a amend_id_high column to the filer_filings table and run SQL like this on all of the tables that contain a FILING_ID column and as AMEND_ID column:
--------------
update filer_filings f1,
(select filing_id, max(amend_id) as max
from cvr_campaign_disclosure group by filing_id) as a1
set f1.amend_id_high = a1.max
where f1.filing_id = a1.filing_id and
(f1.amend_id_high is NULL or f1.amend_id_high < a1.max)
--------------
Query OK, 190135 rows affected (53.69 sec)
Rows matched: 190135 Changed: 190135 Warnings: 0
--------------
update filer_filings f1,
(select filing_id, max(amend_id) as max
from cvr_e530 group by filing_id) as a1
set f1.amend_id_high = a1.max
where f1.filing_id = a1.filing_id and
(f1.amend_id_high is NULL or f1.amend_id_high < a1.max)
--------------
Query OK, 187 rows affected (0.68 sec)
Rows matched: 187 Changed: 187 Warnings: 0
I think having the high amend_id value associated clearly with the filer_filings row is useful and will be more useful when other forms get analysed.
Just FYI, I then partition each table into years, so I have "expn_2000", "expn_2001", ..., "expn_2015" and "expn_else" (for everything < 2000 and > 2015). And then I was going to start partitioning the tables by form_id. This would end up creating a lot of tables, but on the other hand, it would be pretty obvious what the "F460_EXPN_2014" table was supposed to represent, where it would be used and it could make it easier for apps to access the data more quickly, assuming they would search within a year. This might be good, yes?
I am not asking for someone else to do work based on this suggestion. But I would like to know whether you all think that this might be a reasonable thing to consider.
I am trying to understand what you guys are doing and I think I can make this suggestion, reasonably, based on what I am seeing now.
It appears that you all build up the filings table by pulling part of the data out of the FILER_FILINGS_CD table. You only use filings for 3 forms, the 450, the 460 and the 497. And then you mark the early amendment filings with the is_duplicate flag. But then you also set the is_duplicate flag for the rows where the PERIOD_ID does not match to your table for the periods.
You do not delete the rows marked with is_duplicate but I do not see in other code where you filter these rows out of results. But I may just be not seeing it. Regardless, it appears you are keeping the early amendment rows in the same table as the others.
I wanted to suggest an approach I was working on, which is to partition all tables that contain a FILING_ID by several criteria. The first criteria was the early amendment. So, for example, I created a "expn_lows" table next to my "expn" table and moved all early amendment data to that table so that those rows would not appear in the "expn" table.
First, I add a amend_id_high column to the filer_filings table and run SQL like this on all of the tables that contain a FILING_ID column and as AMEND_ID column:
I think having the high amend_id value associated clearly with the filer_filings row is useful and will be more useful when other forms get analysed.
Just FYI, I then partition each table into years, so I have "expn_2000", "expn_2001", ..., "expn_2015" and "expn_else" (for everything < 2000 and > 2015). And then I was going to start partitioning the tables by form_id. This would end up creating a lot of tables, but on the other hand, it would be pretty obvious what the "F460_EXPN_2014" table was supposed to represent, where it would be used and it could make it easier for apps to access the data more quickly, assuming they would search within a year. This might be good, yes?
I am not asking for someone else to do work based on this suggestion. But I would like to know whether you all think that this might be a reasonable thing to consider.