[Forecasted Pipeline] Update grants table in GOST

jeffsmohan commented 3 months ago

Why is this issue important?

We want forecasted grants available in our dataset (alongside active grants) so we can integrate them into the product for our users.

Current State

The grants table in the postgres database currently has columns to represent active grants. Meanwhile, the forecasted grants we want to flow through the grants-ingest pipeline have a different set of fields. (Many are shared between the two data types, but each has a few unique ones.)

The following is a visual-diff output showing the field name differences between the OpportunitySynopsisDetail_1_0 (left column) and OpportunityForecastDetail_1_0 (right column) data models:

AdditionalInformationOnEligibility                              AdditionalInformationOnEligibility
AdditionalInformationText                                       AdditionalInformationText
AdditionalInformationURL                                        AdditionalInformationURL
AgencyCode                                                      AgencyCode
AgencyName                                                      AgencyName
ArchiveDate                                                     ArchiveDate
AwardCeiling                                                    AwardCeiling
AwardFloor                                                      AwardFloor
CategoryExplanation                                             CategoryExplanation
CategoryOfFundingActivity                                       CategoryOfFundingActivity
CFDANumbers                                                     CFDANumbers
CloseDate                                                     <
CloseDateExplanation                                          <
CostSharingOrMatchingRequirement                                CostSharingOrMatchingRequirement
Description                                                     Description
EligibleApplicants                                              EligibleApplicants
                                                              > EstimatedAwardDate
                                                              > EstimatedProjectStartDate
                                                              > EstimatedSynopsisCloseDate
                                                              > EstimatedSynopsisCloseDateExplanation
                                                              > EstimatedSynopsisPostDate
EstimatedTotalProgramFunding                                    EstimatedTotalProgramFunding
ExpectedNumberOfAwards                                          ExpectedNumberOfAwards
                                                              > FiscalYear
FundingInstrumentType                                           FundingInstrumentType
GrantorContactEmail                                             GrantorContactEmail
GrantorContactEmailDescription                                  GrantorContactEmailDescription
GrantorContactText                                            | GrantorContactName
                                                              > GrantorContactPhoneNumber
LastUpdatedDate                                                 LastUpdatedDate
OpportunityCategory                                             OpportunityCategory
OpportunityCategoryExplanation                                  OpportunityCategoryExplanation
OpportunityID                                                   OpportunityID
OpportunityNumber                                               OpportunityNumber
OpportunityTitle                                                OpportunityTitle
PostDate                                                        PostDate
Version                                                         Version

Expected State

We want the grants table to be expanded with the columns it will need to track forecasted grants alongside active grants in the same table. We plan to reuse some of these fields that already exist for posted grants. Therefore, the new fields should be:

forecast_creation_date
award_date
fiscal_year
grantor_contact_name
grantor_contact_phone_number

Implementation Plan

Write a knex database migration to do the following:

Update the grants table to add the needed columns (all nullable):

forecast_creation_date (date)
award_date (date)
fiscal_year (date, always Jan 1 for the given year)
grantor_contact_name (text)
grantor_contact_phone_number (text)

Deprecated:

We want the grants table to be expanded with the columns it will need to track forecasted grants alongside active grants in the same table:

estimated_award_date
estimated_project_start_date
estimated_synopsis_close_date
estimated_synopsis_close_date_explanation
estimated_synopsis_post_date
fiscal_year
grantor_contact_name
grantor_contact_phone_number
is_forecasted (this is not in the grants.gov data models, but we're including it for easy filtering by state)

Implementation Details:

Write a knex database migration to do the following:

Update the grants table to add the needed columns (all initially nullable):
- estimated_award_date (date)
- estimated_project_start_date (date)
- estimated_synopsis_close_date (date)
- estimated_synopsis_close_date_explanation (text)
- estimated_synopsis_post_date (date)
- fiscal_year (date, always Jan 1 for the given year)
- grantor_contact_name (text)
- grantor_contact_phone_number (text)
- is_forecasted (boolean)
Migrate data in the grants table:
- Set is_forecasted to false for all existing rows
Update the is_forecasted column to set a not nullable constraint

Note that all current columns are already nullable except for grant_id and created_at. While this may not be ideal, it does mean we don't have migrate any columns to be nullable for this migration.

jeffsmohan commented 3 months ago

@TylerHendrickson I'd love a quick review of the implementation plan before moving this to ready, thanks!

masimons commented 2 months ago

@TylerHendrickson @jeffsmohan Just want to make sure that fiscal_year should indeed be text and not string?

jeffsmohan commented 2 months ago

@masimons Good question! TLDR: My bad, I think we should go with a date column instead.

Deeper explanation: Given that we don't control and thus can't 100% trust the data coming in from this data source, we've generally used text over varchar(n) so we can accommodate whatever strange data might crop up. (And fwiw, it looks like the performance is basically equivalent between them in postgres.)

Given that this field represents a year, you could also make an argument for using an integer or date (constrained to the first day of the year) type. Before attempting that, you'd want to make sure there aren't any textual representations sneaking in (e.g., <FiscalYear>2024 or 2025</FiscalYear>). I ran a quick analysis, and at least to-date the values are all normal years:

> grep '<FiscalYear>' ~/Downloads/GrantsDBExtract20240709v2.xml | sed -E 's/.*<FiscalYear>(.*)<\/FiscalYear>.*/\1/' | sort | uniq -c
  25 2016
  10 2017
  17 2018
  19 2019
  47 2020
  73 2021
  82 2022
 102 2023
 122 2024
 102 2025
  12 2026

Even so, we'd want our system to be robust to eventually a rogue non-year value slipping in. So this becomes design question about where in our flow we want to do the "error handling" of turning an XML value (i.e., text) into a date — we could do it upstream in grants-ingest, in GOST when we insert the data into postgres, or leave it as text and let the frontend handle it, for example.

Given that we're storing other date-like values in postgres as date or timestamp, I think it's worth continuing that convention here. We can add some validation when we're converting the data for insertion. (I've also seen people recommend adding a constraint in the database itself to ensure the month and day are always Jan 1, but that feels like overkill to me. Just make it a date, and users can assume the month and day portions are meaningless.)

TylerHendrickson commented 2 months ago

Just adding my 2¢ on a few things being discussed here:

At least in Postgres, text is preferable to varchar(n) in nearly all cases – as @jeffsmohan noted, performance is pretty much identical (it's all the same variable-length array data type), so varchar(n) only really differs in the length constraint. Given that changing the length of a varchar(n) requires a full-column replacement, I think it's usually more convenient to enforce size limits on text column values using a check constraint if you need to do that at the DB level.
In general, my preference would be to ensure data quality within the grants-ingest pipeline. Certainly there could be reasons to deviate here, and downstream consumers (which is currently just usdr-gost/Finder) might need their own sets of additional data validations, but one significant reason the grants-ingest pipeline exists is to ensure that grant data exists in a predictable, well-defined format.
- As a practical note, the grants-ingest pipeline has fairly comprehensive monitoring and alerting infrastructure for catching unexpected/edge-case problems with data quality so that we can make informed decisions about dealing with those issues as they come up.

masimons commented 2 months ago

Wanted to leave a comment that the opportunity_status on the grants table currently tracks posted, closed, and archived statuses, and it may make sense to use this field instead of creating a is_forecasted flag.

Also, close_date_explanation on the grants table seems to map to estimated_synopsis_close_date_explanation, so we may be able to reuse this field as well.

usdigitalresponse / usdr-gost