untoldone / bloomapi

Create APIs out of public datasources
https://www.bloomapi.com/documentation/public-data
MIT License
89 stars 29 forks source link

Wildcards in queries? #47

Closed tiembo closed 9 years ago

tiembo commented 10 years ago

Is it currently possible to query using wildcards? For example, searching zip code for '943*' would return practices whose zip codes start with 943.

untoldone commented 10 years ago

This isn't currently possible -- but wouldn't be too difficult to add. The code to switch wildcards to sql would be pretty easy, but we'd also want to play with indexes to see if a GIN index (or another) would do the trick for perf in postgres.

What's your scenario? I'm guessing it absolutely makes sense to add support for this, but I'd like to better understand.

tiembo commented 10 years ago

Having wildcards for first and last name would be helpful for when the provider's name is long or difficult to spell. I guess the ultimate for searches would include misspellings ("Denis" for a Dr. Dennis) too...

untoldone commented 10 years ago

I see where you're going with this. Simple wildcard searching shouldn't be too hard -- but other types of full-text search I'm less familiar with in Postgres. I might have time to work on the former (wildcard) in the not too distant future. Also, if you'd like to give it a shot sooner, I could give you a quick crash course on where to look yourself.

boroth commented 10 years ago

Adding an additional scenario:

If querying for a 5-digit zip code, the API won't return values that have 9 digits. (e.g. if I search for "75081", it won't return a result that has a zip code of "750815850")

Edit: I've just gotten started using the API, and haven't done any contributing yet, but may have time to look into this if needed.

anatolyg commented 10 years ago

FWIW, I implemented a comprehensive search like this in Elasticsearch. PG is great for queries, but once you start doing wildcards and AND/LIKE clauses, the performance on non-iron suffers. Elasticsearch is a good alternative, and provide better searching with relative ease using the jdbc-river integration.

untoldone commented 10 years ago

Great timing here @anatolyg @boroth -- I'm about to start working on BloomAPI again and will be addressing search (likely with elasticsearch).

@boroth In the short term though, if this is a blocker for you using BloomAPI -- let me know and I can take a closer look at a quick interim fix for this.

boroth commented 10 years ago

I don't think it's an immediate blocker (we'll have options to search by name/phone/fax as well), but it's definitely a nice to have. What's the timeline for completing the new elasticsearch work?

untoldone commented 10 years ago

Honestly not sure at the moment -- I'm aiming to get through some other issues first + do some long term planning in the next few days and can let you know when I have a better idea.

anatolyg commented 10 years ago

If you want to check out how I modeled this in es, I can share both the mapping as well as the script to create the index. This can be automated so that the index creation/update happens as bloom API is updated

On Oct 15, 2014, at 6:05 PM, Michael Wasser notifications@github.com wrote:

Honestly not sure at the moment -- I'm aiming to get through some other issues first + do some long term planning in the next few days and can let you know when I have a better idea.

— Reply to this email directly or view it on GitHub.

untoldone commented 10 years ago

Sure -- please do, I'd love to see how you've accomplished it.

On Wed, Oct 15, 2014 at 6:20 PM, Anatoly Geyfman notifications@github.com wrote:

If you want to check out how I modeled this in es, I can share both the mapping as well as the script to create the index. This can be automated so that the index creation/update happens as bloom API is updated

  • Anatoly

On Oct 15, 2014, at 6:05 PM, Michael Wasser notifications@github.com wrote:

Honestly not sure at the moment -- I'm aiming to get through some other issues first + do some long term planning in the next few days and can let you know when I have a better idea.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/untoldone/bloomapi/issues/47#issuecomment-59301553.

untoldone commented 10 years ago

@boroth If you're interested, want to send me a note (http://about.me/mwasser)? I'm scoping out my next release and it would be helpful to better understand how you're using/ planning to use bloomapi.

boroth commented 10 years ago

Will do. On Oct 16, 2014 12:31 PM, "Michael Wasser" notifications@github.com wrote:

@boroth https://github.com/boroth If you're interested, want to send me a note (http://about.me/mwasser)? I'm scoping out my next release and it would be helpful to better understand how you're using/ planning to use bloomapi.

— Reply to this email directly or view it on GitHub https://github.com/untoldone/bloomapi/issues/47#issuecomment-59398956.

boroth commented 10 years ago

Basically I’m just using it for an easy way to uniquely identify healthcare providers. We’d like to use Bloom API in two different ways. First, we’re using it in our signup process as it’s an easy way to pre-populate form data with an organization’s information (basically the user just searches for their own organization/individual and we link that npi number to their account - along with some other validation). Secondly, we’d like to use it so that our users can search for non-users using the NPI database to find their contact information (for interactions outside of our application).

untoldone commented 10 years ago

@anatolyg looks like there's several different convos happening here -- but was digging through open issues and found the mapping you were talking about at https://github.com/untoldone/bloomapi/pull/55 -- thanks!

boroth commented 10 years ago

Sorry I missed your call last week, I've been out of town camping and haven't had my mobile on as often as usual.

Basically I don't have any problems with the current API except for the fact that it doesn't support partial matches. Also, It'd be nice if we could search multiple fields at once, without having to match on all fields (mostly when searching for Organizations, because they can have multiple names - e.g. Austin OMS vs Austin Oral and Maxillofacial Surgery).

I've been looking at http://docnpi.com/, and they have a similar setup to how I want to use the BloomAPI. I got some pretty neat ideas from them on how to detect relationships between individuals/organizations by using phone numbers and addresses to determine what entities exist at the same physical locations (I can identify these types of relationships with BloomAPI already, which is nice).

I haven't had much problems when searching for individuals, because it's more likely that a user can correctly enter the First/Last name of the person they're looking for, whereas a lot of organizations go by other names that may or may not be referenced in the NPI database. I may end up just pushing our users to search for individuals, and then in building the relationships they may be able to identify the correct organization as well.

All-in-all, I'm happy with what BloomAPI is doing at the moment. As I've learned more about the NPI data in general, I've gotten a better grasp on the best ways to query for data through Bloom, although the only thing I'm still missing are partial matches.

Just got back into working with it, and the other functionality I'm looking forward to is the possibility of being able to do an "OR" based search (i.e. matching any of the parameters provided, rather than all parameters).

untoldone commented 10 years ago

Thanks for the detailed response.

I've recently gotten a bunch of time to spend on BloomAPI and making it solve NPI for developers so I'd be happy to try to focus on any scenarios you'd be using. I'll definitely prioritize partial matching + more complex querying (e.g. for matching some fields -- again, probably via ElasticSearch) in the coming weeks. If you have a specific timeline you're trying to get this stuff done in, I can try to make this stuff work within it.

Is there anything else that would lower barriers to usage for you? e.g. A client library in a specific lang.

A few follow questions if you have a second:

On Tue, Oct 21, 2014 at 7:49 AM, Bo Roth notifications@github.com wrote:

Sorry I missed your call last week, I've been out of the country and haven't had my mobile on as often as usual.

I'm still traveling at the moment, but basically I don't have any problems with the current API except for the fact that it doesn't support partial matches. Also, It'd be nice if we could search multiple fields at once, without having to match on all fields (mostly when searching for Organizations, because they can have multiple names - e.g. Austin OMS vs Austin Oral and Maxillofacial Surgery).

I've been looking at (DocNPI)[http://docnpi.com/], and they have a similar setup to how I want to use the BloomAPI. I got some pretty neat ideas from them on how to detect relationships between individuals/organizations by using phone numbers and addresses to determine what entities exist at the same physical locations (I can identify these types of relationships with BloomAPI already, which is nice).

I haven't had much problems when searching for individuals, because it's more likely that a user can correctly enter the First/Last name of the person they're looking for, whereas a lot of organizations go by other names that may or may not be referenced in the NPI database. I may end up just pushing our users to search for individuals, and then in building the relationships they may be able to identify the correct organization as well.

All-in-all, I'm happy with what BloomAPI is doing at the moment. As I've learned more about the NPI data in general, I've gotten a better grasp on the best ways to query for data through Bloom, although the only thing I'm still missing are partial matches.

— Reply to this email directly or view it on GitHub https://github.com/untoldone/bloomapi/issues/47#issuecomment-59939516.

boroth commented 10 years ago

The fields I'd like to use to search are npi, business_name, last_name, first_name, business_address.phone/fax/zip, practice_address.phone/fax/zip. Let me give you a quick rundown of how we're planning to use the api:

In regards to your other questions:

I'm using AngularJS for our client-side app, and I have used an existing autocomplete module to do autocomplete with AJAX in other areas of our app already, so I should be able to implement that with BloomAPI fairly easily if we decide we want autocomplete (it's not high on the priority list though).

At the moment, we're only focused on using the NPI database to find organization structures and associates of individuals, but I know I've heard talks of eventually getting into tracking insurances as well so that might be something we're interested in down the line (not even really considering it yet though).

I think I can get pretty much everything done I need with BloomAPI as it is now - even when I'm making 4 queries to try and find matches the total request time stays below 1.5 seconds (and when you're dealing with healthcare software - that's not bad). With that being said, I think our goal is to have our search/identification functionality in place within the next 2-3 weeks, with probably an extra week on top of that for last minute additions, etc.

untoldone commented 10 years ago

Thanks -- thats all really helpful.

One last quick question: are the 4 queries you described at the end of your note all related to the same lookup? If so, what do the 4 queries look like (/what are they)? If it looks like a common enough pattern, maybe there's some simplifications I can make to the API to consolidate them?

Let me know if you need any other help while implementing! Would love to hear how it goes.

Michael

On Wed, Oct 22, 2014 at 7:18 AM, Bo Roth notifications@github.com wrote:

The fields I'd like to use to search are npi, business_name, last_name, first_name, business_address.phone/fax/zip, practice_address.phone/fax/zip. Let me give you a quick rundown of how we're planning to use the api:

  • User searches for an individual/organization based on name or fax number (we've found these are the most likely pieces of information for our users to have)
    • As of right now I have to have distinct input fields for org name vs individual name
    • After selecting an entity from the NPI results, we'll make a 2nd query that attempts to find potential "associates" of that entity
    • We can accomplish this by searching for all other NPI results that share a phone/fax number
    • At the moment, I have to make 4 queries to the API to get those matches (this is why I would like the more complex querying)

In regards to your other questions:

I'm using AngularJS for our client-side app, and I have used an existing autocomplete module to do autocomplete with AJAX in other areas of our app already, so I should be able to implement that with BloomAPI fairly easily if we decide we want autocomplete (it's not high on the priority list though).

At the moment, we're only focused on using the NPI database to find organization structures and associates of individuals, but I know I've heard talks of eventually getting into tracking insurances as well so that might be something we're interested in down the line (not even really considering it yet though).

I think I can get pretty much everything done I need with BloomAPI as it is now - even when I'm making 4 queries to try and find matches the total request time stays below 1.5 seconds (and when you're dealing with healthcare software - that's not bad). With that being said, I think our goal is to have our search/identification functionality in place within the next 2-3 weeks, with probably an extra week on top of that for last minute additions, etc.

— Reply to this email directly or view it on GitHub https://github.com/untoldone/bloomapi/issues/47#issuecomment-60091575.

boroth commented 10 years ago

Yes, the 4 queries I'm making are an attempt to find associated NPI entries. Once I have an entry selected (let's call it selectedEntry), I run the following queries:

Ideally I'd like to run cross-queries as well:

but for the time being it seems like 4 gets the job done (NOTE: querying multiple times obviously gives me some duplicate results, so I have to pick out the unique entries after I've completed the queries)

I'm just trying to get a unique list of other NPI entries that share a phone/fax number. Whether or not checking business phones against practice phones is a good idea, I'm not so sure about. I worry about cases like Wal-Mart, for example, who probably has thousands of entities in the NPI database, and it's possible that large numbers of them may share the same "business"/corporate phone number. I just haven't started thinking about organizations of that scale yet (and don't know how we want to deal with them in our application yet either), so it's something that I'm sure I'll feel more comfortable about after working with the NPI dataset some more.

anatolyg commented 10 years ago

Bo, also look at the authorized official, this person has a phone number as well, and I’ve been able to link that phone number to large hospitals, useful for aggregating departments. I have this use case as well, and found elastic search to be basically superior. In the second example, where you want to do cross queries, that translates into a query_string query in ES, like so:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

{ "query_string" : { "fields" : [‘practice_address_phone’, ‘business_address_phone’, ‘practice_address_fax’, ‘business_address_fax’, ‘authorized_person_phone'], "query" : “2125551212" } }

as far as the walmart usecase, it’s an interesting one. There are I think about 8150 walgreens pharmacies in the NPI dataset. At least for these, the BUSINESS phone numbers reflect the headquarters, while the PRACTICE reflects the location. I found this to be a good rule of thumb.

A

On Oct 22, 2014, at 9:28 AM, Bo Roth notifications@github.com wrote:

Yes, the 4 queries I'm making are an attempt to find associated NPI entries. Once I have an entry selected (let's call it selectedEntry), I run the following queries:

practice_address_phone = selectedEntry.practice_address_phone practice_address_fax = selectedEntry.practice_address_fax business_address_phone = selectedEntry.business_address_phone business_address_fax = selectedEntry.business_address_fax Ideally I'd like to run cross-queries as well:

practice_address_phone = selectedEntry.business_address_phone business_address_phone = selectedEntry.practice_address_phone practice_address_fax = selectedEntry.business_address_fax business_address_fax = selectedEntry.practice_address_fax but for the time being it seems like 4 gets the job done (NOTE: querying multiple times obviously gives me some duplicate results, so I have to pick out the unique entries after I've completed the queries)

I'm just trying to get a unique list of other NPI entries that share a phone/fax number. Whether or not checking business phones against practice phones is a good idea, I'm not so sure about. I worry about cases like Wal-Mart, for example, who probably has thousands of entities in the NPI database, and it's possible that large numbers of them may share the same "business"/corporate phone number. I just haven't started thinking about organizations of that scale yet (and don't know how we want to deal with them in our application yet either), so it's something that I'm sure I'll feel more comfortable about after working with the NPI dataset some more.

— Reply to this email directly or view it on GitHub https://github.com/untoldone/bloomapi/issues/47#issuecomment-60113410.

boroth commented 10 years ago

I hadn't thought about using the authorized official number as a way to link them, and I'll probably add that as an additional query for the moment.

I'm already pulling unique addresses out of the practice addresses in order to pre-populate some location-based data in our forms, so that pretty much confirms what I was thinking - thanks!

@anatolyg are you just using your own personal installation of bloom with ES integrated? Or do you have the NPI database downloaded on your own with ES completely separate from bloom?

anatolyg commented 10 years ago

I am hosting the BloomAPI internally. It’s a data source within my system. We host our own bloom but do not use its API — I found that ES is faster than PG for most cases, and allows for more detailed search queries. We have our own search API that uses an ES index that’s made up of Bloom and a number of other datasources, merged together into a single searchable index. Here’s an example of the JSON document in this index. Notice a few things:

  1. we have all the taxonomies in a single data structure, and not in separate fields. This lets us do much better multi-speciality searching using “TERMS” query in ES
  2. same for other_ids (we don’t have a huge use case for these yet, and I found them to be somewhat unreliable for matching to medicare datasets)
  3. broke out the PRIMARY taxonomy into its own field. This is a personal preference. We could have gone w/ a more complex data structure for taxonomies, but decided to make it simpler

         {
     "_index":"providers_v5",
     "_type":"provider",
     "_id":"e390db71-0c64-4821-8cf1-5f4b74d50614",
     "_score":1,
     "provider_last_name_legal_name":null,
     "provider_first_name":null,
     "provider_organization_name_legal_business_name":"WALGREEN CO.",
     "provider_other_organization_name":"WALGREEN #10548",
     "parent_organization_lbn":"WALGREEN CO",
     "last_update_date":"2014-02-03T00:00:00.000Z",
     "provider_enumeration_date":"2007-12-05T00:00:00.000Z",
     "npi_deactivation_date":null,
     "npi_reactivation_date":null,
     "entity_type_code":2,
     "is_organization_subpart":"Y",
     "is_sole_proprietor":null,
     "provider_gender_code":null,
     "provider_first_line_business_practice_location_address":"549 HOOSICK ST",
     "provider_second_line_business_practice_location_address":null,
     "provider_business_practice_location_address_city_name":"TROY",
     "provider_business_practice_location_address_state_name":"NY",
     "provider_business_practice_location_address_postal_code":"121802105",
     "provider_business_practice_location_address_telephone_number":"5182745080",
     "provider_business_practice_location_address_fax_number":null,
     "provider_first_line_business_mailing_address":"1901 E VOORHEES ST",
     "provider_second_line_business_mailing_address":"M/S 720",
     "provider_business_mailing_address_city_name":"DANVILLE",
     "provider_business_mailing_address_state_name":"IL",
     "provider_business_mailing_address_postal_code":"618344509",
     "provider_business_mailing_address_fax_number":"2177092344",
     "provider_business_mailing_address_telephone_number":"2177092386",
     "authorized_official_last_name":"CRAWFORD",
     "authorized_official_first_name":"KERMIT",
     "authorized_official_title_or_position":"PRESIDENT",
     "authorized_official_telephone_number":"8473153154",
     "primary_taxonomy":"333600000X",
     "source_ids":{
     "openpayments_teaching_hospital_id":null,
       "npi":1740462357,
       "openpayments_physician_id":null,
       "cms_provider_num":null
     },
     "counts":{
     "prescriptions":0,
       "charges":{
       "phys_procedures":6,
         "inpatient_procedures":0,
         "outpatient_procedures":0
     },
     "referrals":5,
       "openpayments":{
       "physician_general_payments":0,
         "physician_ownership":0,
         "physician_research_payments":0,
         "hospital_research_payments":0,
         "hospital_general_payments":0
     }
     },
       "taxonomy":[
       "0",
       "332B00000X",
       "333600000X",
       "3336C0003X"
     ],
       "taxonomy_group":"0",
       "other_ids":[
       {
         "id":"0282936152",
         "issuer":null,
         "state":"NY",
         "type":"07"
       },
       {
         "id":"PHC049",
         "issuer":null,
         "state":null,
         "type":"08"
       },
       {
         "id":"",
         "issuer":"",
         "state":"",
         "type":""
       },
       {
         "id":"02960185",
         "issuer":null,
         "state":"NY",
         "type":"05"
       },
       {
         "id":"3357022",
         "issuer":"NCPDP",
         "state":null,
         "type":"01"
       },
       {
         "id":"P00400633",
         "issuer":null,
         "state":null,
         "type":"08"
       }
     ]
    }
boroth commented 10 years ago

I'm going to make an optimistic assumption that the 502 bad gateway I'm getting on http://www.bloomapi.com/ means you're currently working on the api? :)

untoldone commented 10 years ago

Thanks for reporting this! It wasn't supposed to be down :). Initial investigation shows it went down about 2 hours ago (10:39am PST). Opened #57 to fix the issue that caused the crash and turned #40 into a bug to ensure this type of crash doesn't occur in the future.

boroth commented 10 years ago

Awesome, I figured it wasn’t intentional :D=

boroth commented 9 years ago

Hey fellas, just wondering if elasticsearch (or wildcard queries) has made any progress? I just got back into the NPI world to setup a query tool, so figured I'd check in and see where things are at or if I can be of help anywhere.

I don't have any experience with Elastic Search, but would love to install it and give it a shot on my local installation of BloomAPI.

untoldone commented 9 years ago

Getting ready to push out new code that uses ElasticSearch on the backend rather than just Postgresql (Its actually already running at www.bloomapi.com). While this doesn't directly expose wildcard searches, it makes it very, very easy to add.

Before adding a features I wont be able to take back once released (since it will become a feature people start to depend on), I also have a second branch that adds 'zip5' and 'zip_plus4' fields to search on. Would this be enough to solve your problem? This code is ready to go right now but just not deployed yet.

boroth commented 9 years ago

I'm not really using zip fields at the moment, we're primarily going to try and use phone/fax numbers as our main search, with name as a backup (using wildcards for trying to match org and individual names with the same term).

We're struggling a bit to decide on the best (easiest) field or attribute to ask our users for in order to find the individual or organization they're trying to locate. Ideally we'd like to keep it to a single field on the frontend of our application, but we could also perform some sort of logic/matching in order to format the api request before sending (similar to what you're doing in the test app for BloomAPI).

untoldone commented 9 years ago

Do you have an idea of what the best case scenario is separate from BloomAPI?

So if I were to try write that myself, I'd probably do a few different things:

In terms of implementing the above scenarios, I'd definitely consider using ElasticSearch directly (either in your own BloomAPI deployment or otherwise -- code to be published really soon) as I'm not sure the current API implementation will get you 100% there. See things such as: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/partial-matching.html and http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_query_time_search_as_you_type.html (great for autocomplete)

If you want help getting this scenario running, I'd be happy to help a bit more to make sure its doable. I'd add the above to the API but I'm currently worried people would abuse it in the public API/ would be costly to run without charging without putting more thought into it before adding functionality for it.

boroth commented 9 years ago

That's exactly what I was thinking. Didn't mean to imply that you should add it to the API, just thinking out loud.

Should I just install ElasticSearch from scratch and set it up with the current version of BloomAPI, or is ElasticSearch going to be part of the install after you push out the new updates?

anatolyg commented 9 years ago

the way I did it is created a view, called search, which is a materialized view of BloomAPI data (plus some more stuff, but it doesn't matter). Then I used a stock Elasticsearch install w/ JDBC River to pull data from that view into ES. The key is to use the format described in the river so that you have nested objects and all that good stuff so it's easy to search by various options (terms are great for taxonomies, wildcards/fuzzy/query for names, match for zip code). Here are the instructions: https://github.com/jprante/elasticsearch-river-jdbc

untoldone commented 9 years ago

@boroth New code about to be published -- this will import docs into elasticsearch for you from a normalized version of the NPI data.

untoldone commented 9 years ago

Just opened a new bug to track the issue described here. Hopefully that bug does a better job capturing what people have been asking for as I'm not entirely convinced just adding wildcards is the right answer long term. Let me know if this seems incorrect or inaccurate.

@boroth Just published the new code that uses ElasticSearch. Tried to document it in the deploy section of the documentation page + the contribute pages currently online.

boroth commented 9 years ago

Sounds good. Being able to customize queries in ElasticSearch should make it easier for any schmuck like myself to write in his own wildcards, if necessary :P