untoldone / bloomapi

Create APIs out of public datasources
https://www.bloomapi.com/documentation/public-data
MIT License
89 stars 29 forks source link

Investigate parsing and mapping of taxonomy and affiliation identifiers #30

Closed marks closed 9 years ago

marks commented 11 years ago

Need per discussions with others and the following on the docgraph mailing list

I was hoping someone already figured out a good way to separate out the dentist and nurses out of the NPPES data by Taxonomy Code in order to get a more manageable data set to hopefully get more insight on the doctor data.
via Ross Parks parkselectricalservice@gmail.com

marks commented 11 years ago

Some resources:

rossparks commented 11 years ago

Here is a start but the Taxonomy Codes need to be broken out. (http://www.nucc.org/index.php?option=com_content&view=article&id=107&Itemid=132)

I have found that the first two digits is a good start but Specialty Allopathic & Ostepathic Physicians for instances I would normally filter out but if you dive deeper you see that these are not necessarily the type of doctors you would think they are.

untoldone commented 11 years ago

@rossparks What do you mean by broken out? Do you mean categorized? If the type, classification and specialization fields were included for each taxonomy code in bloomapi would that be enough or would you want more details?

rossparks commented 11 years ago

Well it seems that the first two digits specify the "specialty" so that would separate out the dentists from the doctors and the first letter in the taxonomy code help determine for instance what type of family practice doctor they are.

An Example: Taxonomy Code Provided from Data are 101Y0 101YA 101YM 101YP

The first 10 tells us the type of doc. : 10 = Behavioral Health & Social Service Providers The Next 1 or Y or some combination tells us the classification of doc. 1Y= Counselor The Next Letter tells us the Specialty. : A= Addiction, M=Mental Health, P= Pastoral, another P =Professional, S=School. 0 = no specialization;

The Taxonomy Codes in the public data seem to be only 5 characters long where the full_Code is 10 characters long. That may be a problem in assigning the correct taxonomy description to the docs.

Ill make edits to this post later....busy afternoon

rossparks commented 11 years ago

sorry to answer your question YES i think that would be enough if you had Type, Classification and Specialty.

marks commented 11 years ago

It can wait until our call next week, @untoldone , but I am trying to think how best to get this started.

  1. My first inclination to just get the data returned w/ the rest of the data per NPI is to make a db table with the contents of the CSV from http://www.nucc.org/index.php?option=com_content&view=article&id=107&Itemid=132 (or another source of the data.. whatever is most truthy). This requires no parsing of the taxonomy code and just bringing in more DB data at time of JSON rendering.
  2. The complexity/confusion, to me, begins when we want to be able to QUERY by this. At this point, we would want to be able to search by a taxonomy code and I guess we could just do searching off the 15 taxonomy_code columns (i.e. psuedo sql: select all npis where a taxonomy_code begins with 10 for Behavioral Health & Social Service Providers or select all npis where the second two chatacters in the taxonomy_code are 1Y for Counselors)? Seems messy but it seems pretty clear your goal was to create an API off of the single table and not de-normalize the data, only the JSON representation (I can see pros and cons for both ways).

Mark

untoldone commented 11 years ago

@marks Yeah -- your first inclination was my first as well with the same problem of complexity and confusion.

Another direction is to also create a new table with taxonomy codes and db join 15 times on all taxonomy_code columns and use a WHERE to find it (e.g. WHERE taxonomy_classification_01 = 'Behavioral Health & Social Service Providers' OR taxonomy_classification_02 = ...) but this is could lead to some overly complex code.

Yet another possible solution is to import the taxonomy datasource you reference above before importing the NPI and then do the 'join' while piping the NPI to postgres during insert/ bulk load. I think this is technically the easiest solution, but probably isn't a valid technique for all datasources we'd want with the NPI.

I'm actually not tied to the de-normalization on query vs in db at data-load time -- that was just seemed the simplest solution at the time. I think depending on where bloomAPI goes in the future, this decision may be worth revisiting.

Lets chat more on the phone

marks commented 11 years ago

@untoldone - are you able to convert the NUCC file to ASCII or UTF8 so that node can read it? No matter what I try I seem to be getting blank strings in node and iconv: nucc_taxonomy.csv:2:275: cannot convert error when I try to resort to converting the encoding before parsing it. The issue appears to be 'smart quotes'.

untoldone commented 11 years ago

@marks like loading the file to a string? Also, to make sure we are looking at the same thing my taxonomy file came from curl -O http://nucc.org/images/stories/CSV/nucc_taxonomy_131.csv on my mac. I had success with the following but haven't put it into a csv parser or anything:

var fs = require('fs'),
    contents = fs.readFileSync('./nucc_taxonomy_131.csv', {encoding: 'utf8'});

console.log(contents);

Want to send me a gchat or something if you meant something else?

marks commented 11 years ago

Yeah, I was doing something wrong, clearly. Thanks for the quick response

arthurjohnston commented 11 years ago

For flexibility you probably want to store the provider to taxonomy mappings in a separate table. Then a 3rd table where you could store categorizations you care about. So the at the end for non-dentists join would be .. join taxonomy_mapping tm on tm.npi=provider.npi join categorization c on c.taxonomy_id=tm.taxonomy_id where c.category_id != 122

where 122 is the code for dentists

untoldone commented 9 years ago

Taxonomy codes can now be queried at http://www.bloomapi.com/api/search/nucc.hcpt. Classifications of codes can be queried at http://www.bloomapi.com/api/search/nucc.hcpt_classifications.

I will be pushing out documentation for this with the next release.