ncopenpass / CampaignFinanceDataPipeline

Data Pipeline for NC Campaign Finance Dashboard
Apache License 2.0
2 stars 4 forks source link

add exploratory notebook using nlp to extract names #13

Closed janash closed 3 years ago

janash commented 3 years ago

This PR adds a notebook which demonstrates using truecase and spacy to perform natural language processing on committee names. Spacy is capable of extracting 'named entities' and recognizing named people. The notebook contained here demonstrates first applying true case to the committee name (because spacy will not work on the all caps names), then using spacy to extract names. Some errors do occur, but it works nicely for many of the committees.

I wanted to offer this as a potential supplement to @jumptable's solution of using regular expressions, and also to leave it in mind as an approach for future problems. I've borrowed @jumptable's approach for creating an environment and downloading dependencies (copied Makefile and .txt) from his pull request.