This repository deduplicates property owners in Massachusetts using the MassGIS standardized assessors' parcel dataset and the Secretary of the OpenCorporates Bulk Data release for Massachusetts.
Who Owns Massachusetts Processing and Deduplication

This repository deduplicates property owners in Massachusetts using the MassGIS standardized assessors' parcel dataset and legal entity data sourced from OpenCorporates under their 'public-benefic project' program. The process builds on Hangen and O'Brien's methods (2024), which are themselves similar (though not identical) to methods used by Henry Gomory (2021) and the Anti-Eviction Mapping Project's Evictorbook (see e.g., McElroy and Amir-Ghassemi 2021). It also builds on Eric's experience leading development of a tool called TenantPower with Mutual Aid Medford and Somerville in 2020, which used the dedupe Python package in a manner similar to Immergluck et al. (2020).

While we share large parts of their approach (i.e., relying on community detection on company-officer relationships, following cosine-similarity deduplication of names), we believe that our results are more robust for several reasons. Inspired, in part, by Preis (2024), we expend a great deal of effort on address standardization so that we can use addresses themselves as network entities (prior approaches, with the exception of Preis, have just concatenated addresses and names prior to deduplication). This is a substantial change: "similar" addresses, by whatever measure, can still be very different addresses. By relying on standardized unique addresses, we believe that we are substantially reducing our false positive rate.

Community detection---based on both network analysis and cosine similarity---is accomplished using the igraph package's implementation of the fast greedy modularity optimization algorithm. Cosine similarity is calculated using the quanteda package.

While the full process requires that you source OpenCorporates data, you can run the cosine-similarity-based deduplication process using only the assessors tables. (See the documentation for the OC_PATH configuration variable.)

Getting Started

Data Dictionary

Please consult the data dictionary for field definitions.


This library's dependencies are managed using renv. To install necessary dependencies, simply install renv and run renv::restore(). If you are using Windows, you'll probably have to install the Rtools bundle appropriate for your version of R.


The respository uses an instance of PostgreSQL with the PostGIS extension as its primary data store. You'll need to set up a PostGIS instance on either localhost or a server.

Setting up .Renviron

The scripts expect to find your PostgreSQL credentials, host, port, etc. in an .Renviron file with the following environment variables defined:

# Or whatever your port
# Will likely need to be "require" for remote.

Optionally, you can use the PUSH_DBS configuration parameter to specify a different database you'd like to point subroutine results to, allowing you to separate, for example, a development environment from a production environment. If you'd like to make of this parameter, you'll need to pass a string value to the appropriate named elements in PUSH_DBS (see section 'Configuration (config.R)' below), or to load_results("yourstring") and define...

If you modify your .Renviron mid-RStudio session, you can simply run readRenviron('.Renviron') to reload.

.Renviron is in .gitignore to ensure that you don't commit your credentials.

Loading Results (load_results.R)

If you want to simply read the results without worrying about triggering the deduplication process, you can simply begin a new RScript, source load_results.R, and run a one-liner like so...

load_results("your_db_prefix", load_boundaries=TRUE, summarize=TRUE)

This will load companies, munis, officers, owners, sites, sites_to_owners, parcels_point, metacorps_cosine and metacorps_network into your R environment. If load_boundaries is true, it will also return munis, zips, tracts, and block_groups.

Please consult the data dictionary for field definitions.

If summarize is TRUE, it will return a number of summary fields for officers, metacorps_cosine, and metacorps_network that are useful for diagnosing cases of over-inclusion in the network analysis. These appear in the data dictionary as well.

This requires that you have .Renviron set up with appropriate prefixes (see 'Setting up .Renviron', above).

Note that for statewide results, these are very large tables and therefore it might take 5-10 minutes depending on your network connection/whether you're reading from a local or remote database.

Running the Process (run.R)

This is a very time-consuming process, even for small subsets (this is due to the size of the companies and officers tables, which must be processed for reliable results even for smaller spatial subsets). On a 2021 Apple M1 Max chip with 64 GB of memory, the full state is taking a little under 13 hours.

We provide an onmibus manage_run() function in run.R. It does preflight testing and triggers three sequences: a data ingestion sequence (load_read_write_all(), see R/loaders.R), a data processing sequence (proc_all(), see R/processors.R) and a deduplication sequence (dedupe_all(), see R/deduplicators.R).

We recommend running from the terminal using...

Rscript run.R

This is because when the process is run interactively (i.e., in an RStudio environment), intermediate results are stored in an output object, which has memory costs. You can then read the results with load_results.R, as described above.

If the process is run interactively, it automatically outputs results to objects in your environment (including intermediate results if RETURN_INTERMEDIATE is TRUE in config.R. It also writes results to .csv and .Rda files in /results, but doesn't ever try to read these---the PostgreSQL database is the only output location from which our scripts read data.

Configuration (config.R)

We expose a large number of configuration variables in config.R, which is sourced in run.R. In order...

Variable Description
COMPLETE_RUN Default: FALSEA little helper that overrides values such that ROUTINES=list(load = TRUE, proc = TRUE, dedupe = TRUE), REFRESH=TRUE, MUNI_IDS=NULL,and COMPANY_TEST=FALSE. This ensures a fresh, statewide run on complete datasets, not subsets.
REFRESH Default: TRUEIf TRUE, datasets will be reingested regardless of whether results already exist in the database.
PUSH_DBS Default: list(load = "", proc = "", dedupe = "") Named list with string values. If "", looks for .Renviron database connection parameters of the format "DB_NAME". If string passed, looks for parameters of the format "YOURSTRING_DB_NAME" where YOURSTRING can be passed upper or lower case, though parameters must be all uppercase. Note that whatever dedupe is set to is treated as "production", meaning that select intermediate tables from previous subroutines are pushed there as well. Requires that you set .Renviron parameters (see section 'Setting Up .Renviron' above).
ROUTINES Default: list(load = TRUE, proc = TRUE, dedupe = TRUE) Allows the user to run individual subroutines (i.e., load, process, deduplicate). The subroutines are not totally indepdent, but each will run in a simplified manner when it is set to FALSE here, returning only results needed by subsequent subroutines.
MUNI_IDS Default: c(274, 49, 35)If NULL, runs process for all municipalities in Massachusetts. If "hns", runs process for Healthy Neighborhoods Study Municipalities (minus Everett because they don't make owner names consistently available). If "mapc", runs process for all municipalities in the MAPC region. Otherwise, a vector of numbers or strings, but must match municipality IDs used by the state. (Consult muni_ids.csv for these.) If numbers, they will be 0-padded.
MOST_RECENT Default: FALSE If TRUE (and the complete vintages MassGIS collection is being used), reads the most recent vintage for each municipality. If FALSE, attempts to determine which vintage has the largest number of municipalities reporting, selecting that year where possible (and selecting the most recent where a given municipality did not report in that year).
COMPANY_TEST_COUNT Default: 50000The OpenCorporates datasets are big. For that reason, during development it's useful to read in test subsets. This is the number of companies to read in when COMPANY_TEST is TRUE.
COMPANY_TEST Default: TRUEIf TRUE, reads in only COMPANY_TEST_COUNT companies and any officers associated with those companies. (Usually on the order of 4x the number of companies.)
RETURN_INTERMEDIATE Default: TRUEIf TRUE, run() returns intermediate tables. Otherwise, loads only the tables yielded by the last subroutine requested into the R environment while writing all tables to the appropriate databases. (I.e., if ROUTES is list(load=TRUE, proc=TRUE, dedupe=FALSE) and RETURN_INTERMEDIATE is TRUE, it will load tables by proc_all() into the R environment).
COSINE_THRESH Default: 0.85The minimum cosine similarity treated as a match. Lower numbers yield matches on less closely related strings.
INDS_THRESH Default: 0.95The minimum cosine similarity treated as a match for non-institutional owners. Lower numbers yield matches on less closely related strings. This should generally be higher than COSINE_THRESH because there are so many more duplicative names. Note that this is address-bounded, so even close matches will not appear as the same unless there is a shared address.
ZIP_INT_THRESH Default: 1One of our address-parsing tricks is to use ZIP codes that fall entirely within a single MA municipality to fill missing cities, and MA municipalities that fall entirely within a ZIP code to fill missing ZIP codes. This adjusts how close to 'entirely' these need to be - note that a value of 1 introduces substantial computational efficiencies because we can simply use a spatial join with a sf::st_contains_properly predicate rather than the much more expensive intersection. (It also means, unfortunately, that there are none of the second case – no municipalities fall entirely within ZIP codes without some fuzziness.
QUIET Default: FALSEIf TRUE, suppresses log messages. Logs are written to a datetime-stamped file in /logs.
CRS Default: 2249EPSG code for coordinate reference system of spatial outputs and almost any spatial analysis in the workflow. 2249 is NAD83 / Massachusetts Mainland in US feet. (The almost is because ZIPS are processed nationwide using NAD 83 / Conus Albers, AKA EPSG 5070. We don't expose this.)
DATA_PATH Default: "data" This is the folder where input datasets (i.e., OpenCorporates data and MassGIS parcel databases) are located. Do not change unless you also plan on moving luc_crosswalk.csv and muni_ids.csv.
RESULTS_PATH Default: "results" This is the folder where resulting .csv and .Rda files will be written. Note that tables will always be written to the PostGIS database, so this is for backup/uncredentialed result transfer only.
OC_PATH Default: "2024-04-12" Either the name of the folder (within /data) that contains the OpenCorporates bulk data or NULL. Scripts depend on companies.csv and officers.csv. If NULL, a simplified cosine-similarity deduplication routine will run, returning a simpler set of tables.
GDB_PATH Default: "L3_AGGREGATE_FGDB_20240703"This is either a folder (within /data) containing all the vintages of the MassGIS parcel data or a single most recent vintage geodatabase (in /data).


Required External Data

Successful execution of all features of this software requires that you source the following datasets:

Additional Data Sources

In addition, the script pulls in data from a range of sources to enrich our datasets. All of these are ingested from API and web sources by the script, so there is no need to source them independently.


This work received grant support from the Conservation Law Foundation and was developed under the auspices of the Healthy Neighborhoods Study in the Department of Urban Studies and Planning at MIT with input from the Metropolitan Area Planning Council. OpenCorporates has also been a supportive data partner.
