Open jeffsmohan opened 3 months ago
@TylerHendrickson I'd love a quick review of the implementation plan before moving this to ready, thanks!
@TylerHendrickson @jeffsmohan Just want to make sure that fiscal_year
should indeed be text
and not string
?
@masimons Good question! TLDR: My bad, I think we should go with a date
column instead.
Deeper explanation: Given that we don't control and thus can't 100% trust the data coming in from this data source, we've generally used text
over varchar(n)
so we can accommodate whatever strange data might crop up. (And fwiw, it looks like the performance is basically equivalent between them in postgres.)
Given that this field represents a year, you could also make an argument for using an integer
or date
(constrained to the first day of the year) type. Before attempting that, you'd want to make sure there aren't any textual representations sneaking in (e.g., <FiscalYear>2024 or 2025</FiscalYear>
). I ran a quick analysis, and at least to-date the values are all normal years:
> grep '<FiscalYear>' ~/Downloads/GrantsDBExtract20240709v2.xml | sed -E 's/.*<FiscalYear>(.*)<\/FiscalYear>.*/\1/' | sort | uniq -c
25 2016
10 2017
17 2018
19 2019
47 2020
73 2021
82 2022
102 2023
122 2024
102 2025
12 2026
Even so, we'd want our system to be robust to eventually a rogue non-year value slipping in. So this becomes design question about where in our flow we want to do the "error handling" of turning an XML value (i.e., text) into a date — we could do it upstream in grants-ingest, in GOST when we insert the data into postgres, or leave it as text and let the frontend handle it, for example.
Given that we're storing other date-like values in postgres as date
or timestamp
, I think it's worth continuing that convention here. We can add some validation when we're converting the data for insertion. (I've also seen people recommend adding a constraint in the database itself to ensure the month and day are always Jan 1, but that feels like overkill to me. Just make it a date, and users can assume the month and day portions are meaningless.)
Just adding my 2¢ on a few things being discussed here:
text
is preferable to varchar(n)
in nearly all cases – as @jeffsmohan noted, performance is pretty much identical (it's all the same variable-length array data type), so varchar(n)
only really differs in the length constraint. Given that changing the length of a varchar(n)
requires a full-column replacement, I think it's usually more convenient to enforce size limits on text
column values using a check constraint if you need to do that at the DB level.grants-ingest
pipeline. Certainly there could be reasons to deviate here, and downstream consumers (which is currently just usdr-gost
/Finder) might need their own sets of additional data validations, but one significant reason the grants-ingest
pipeline exists is to ensure that grant data exists in a predictable, well-defined format.
grants-ingest
pipeline has fairly comprehensive monitoring and alerting infrastructure for catching unexpected/edge-case problems with data quality so that we can make informed decisions about dealing with those issues as they come up. Wanted to leave a comment that the opportunity_status
on the grants
table currently tracks posted, closed, and archived statuses, and it may make sense to use this field instead of creating a is_forecasted
flag.
Also, close_date_explanation
on the grants
table seems to map to estimated_synopsis_close_date_explanation
, so we may be able to reuse this field as well.
Why is this issue important?
We want forecasted grants available in our dataset (alongside active grants) so we can integrate them into the product for our users.
Current State
The
grants
table in the postgres database currently has columns to represent active grants. Meanwhile, the forecasted grants we want to flow through the grants-ingest pipeline have a different set of fields. (Many are shared between the two data types, but each has a few unique ones.)The following is a visual-diff output showing the field name differences between the
OpportunitySynopsisDetail_1_0
(left column) andOpportunityForecastDetail_1_0
(right column) data models:Expected State
We want the
grants
table to be expanded with the columns it will need to track forecasted grants alongside active grants in the same table. We plan to reuse some of these fields that already exist for posted grants. Therefore, the new fields should be:forecast_creation_date
award_date
fiscal_year
grantor_contact_name
grantor_contact_phone_number
Implementation Plan
Write a knex database migration to do the following:
grants
table to add the needed columns (all nullable):forecast_creation_date
(date)award_date
(date)fiscal_year
(date, always Jan 1 for the given year)grantor_contact_name
(text)grantor_contact_phone_number
(text)Deprecated:
We want the
grants
table to be expanded with the columns it will need to track forecasted grants alongside active grants in the same table:estimated_award_date
estimated_project_start_date
estimated_synopsis_close_date
estimated_synopsis_close_date_explanation
estimated_synopsis_post_date
fiscal_year
grantor_contact_name
grantor_contact_phone_number
is_forecasted
(this is not in the grants.gov data models, but we're including it for easy filtering by state)Implementation Details:
Write a knex database migration to do the following:
Update the
grants
table to add the needed columns (all initially nullable):estimated_award_date
(date)estimated_project_start_date
(date)estimated_synopsis_close_date
(date)estimated_synopsis_close_date_explanation
(text)estimated_synopsis_post_date
(date)fiscal_year
(date, always Jan 1 for the given year)grantor_contact_name
(text)grantor_contact_phone_number
(text)is_forecasted
(boolean)Migrate data in the
grants
table:is_forecasted
tofalse
for all existing rowsUpdate the
is_forecasted
column to set a not nullable constraintNote that all current columns are already nullable except for
grant_id
andcreated_at
. While this may not be ideal, it does mean we don't have migrate any columns to be nullable for this migration.