sfcpc / housing-dashboard

4 stars 0 forks source link

[Schemaless][Ramp up exercise] Add a new source to the schemaless table #173

Open ilakyapal opened 4 years ago

ilakyapal commented 4 years ago

In general, new sources need to be added to the schemaless table if there is some new dataset with housing related data that our scripts need to take into account. Examples of sources that our scripts currently consume to generate the schemaless table include the Planning pipeline dataset, the PTS dataset, the MOHCD pipeline dataset, etc (see README for full list).

This exercise is intended to walk through how to add a new source to the schemaless table generation. This exercise should not require adding too much code, hopefully just a few lines in the right places :)

At this link, you will find a spreadsheet with data (that I mocked) for a new "TestSource". For the purposes of this exercise, pretend that this is some dataset that you need to incorporate in the schemaless table generation.

Here are the steps you'll need to go through:

  1. Download the dataset. Go to this link, and make sure to download a CSV version of the "dataset". Store the downloaded file somewhere in the data/ directory.
  2. Add a new source class to "schemaless/sources.py". Take a look at the "TCO" class, for a simple example to follow. You will want your class to include the same key fields (like "NAME", "FK", "FIELDS", etc). You can leave the "DATA_SF", "DATA_SF_DOWNLOAD" fields empty for now, since this is just a dummy test source.
  3. Make sure to update "source_map" at the bottom of "schemaless/sources.py" to account for the new source class you created.
  4. Update "main" at the bottom of "schemaless/create_schemaless.py" by adding an argument to the parser so the csv file for the new TestSource can be passed into the script (You will want to update the run function that is used here to take in the new file as well).
  5. Update "schemaless/create_uuid_map.py" so that we can make sure records in the TestSource are assigned the same UUID as the PRJ records that they correspond to. To do this, add a new subclass of "RecordGraphBuilderHelper" for the TestSource, and override the "process" function. You can take a look at "PermitAddendaSummaryHelper" for a very simple example to follow. (Hint: The one difference is that you will want to use the "PlanningHelper" and it's "find_by_id" function within your implementation of the "process" function, unlike "PermitAddendaSummaryHelper", which uses the "PTSHelper").
  6. Update self.helpers in RecordGraphBuilder's init function to take into account the class you created in step 5.

Congratulations! You have now set up the scripts to account for the new source. Now let's test it out. To do so:

  1. Update gen-test-data.sh so that our test schemaless/uuid map files incorporate data from TestSource. Add an argument here passing in the csv file that you stored in the data/ directory earlier. (since the TestSource csv is small, we don't need to create a subset of it to use for testing purposes, like we do for other huge sources, like planning)
  2. Run gen-test-data.sh, so that the schemaless and uuid map files get regenerated with the TestSource data.
  3. Sanity check that records from your TestSource were successfully incorporated in the schemaless/uuid map files by taking a look at testdata/uuid-map-one.csv (this should include rows for each of the records in the TestSource).
  4. Update the tests so they pass! Update test_just_dump within schemaless/test_create_schemaless.py to account for the new data source. Add a test to schemaless/test_create_uuid_map.py to verify that your TestSource records are linked to the appropriate PRJ records. You can follow this example.
  5. Run your tests using pytest and make sure they pass!

You're done!

shermanpeng17 commented 4 years ago

Hi, I wanted to say thanks for writing these instructions. I was able to create a new source in schemaless table.

ilakyapal commented 4 years ago

No problem! I'm glad they were useful!