untoldone / bloomapi

Create APIs out of public datasources
https://www.bloomapi.com/documentation/public-data
MIT License
89 stars 29 forks source link

Ensures that an unreadable character doesn't abort the entire import #55

Closed anatolyg closed 9 years ago

anatolyg commented 10 years ago

This happened with a weekly import in April. There was a single row with an invalid character, and the entire import would fail because of this single row. I simply commented out the failure and instead logged it to console.

untoldone commented 10 years ago

@anatolyg Thanks for the patch. I'm currently working on a relatively large rewrite and will make sure this gets rolled in to that. Do you know which weekly diff had this line error in it for me to download and test with?

anatolyg commented 10 years ago

@untoldone I'll find it, it was I believe the March 19th update, but can't see it listed anymore. Btw, if you're open to it, I have some thoughts about the API. I was thinking Elasticsearch would be a good way to go for the search part of the API, especially if you are thinking of adding additional data sources. It has a lot of great functionality for faceted searching and other data aggregations that are useful here. I was thinking of adding at least the newly released CMS data.

If you're redoing the import code, I'd love to take use it / test it for the CMS upload. Let me know!

untoldone commented 10 years ago

@anatolyg I've definitely had interest in building in ElasticSearch in the short term. Would love to hear more about how you'd use it in applications/ directly etc.

The new CMS medicare payment data was one of a number of data sources that should be really easy to import with the new version currently in the works.

anatolyg commented 10 years ago

@untoldone great! As far as ES - one thing I wanted to leverage immediately is its faceted search capability. Taxonomies, geo-location of contacts and locations, identifiers (CMS, etc) contained within the data could be used as facets to provide a greater ability to drill down into the data. My use case for bloomapi is to exploit the facets to allow to much more rich search capabilities.

I tend to think of my APIs as two sides - read and write, where read is ES, and write is whatever (PG, mongodb, etc), with a river going to ES. This way, you can keep the readAPI consistent with the way the data is used without worrying about how you write it.

For import, I think leveraging the river functionality is probably a good first run at it. As a river, you can define a query that will export data from PG and import it directly into ES.

Here's my first stab at the index:

{
    "settings":{
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2
        },
        "analysis":{
            "analyzer":{
                "autocomplete":{
                    "type":"custom",
                    "tokenizer":"standard",
                    "filter":[ "standard", "lowercase", "stop", "kstem", "ngram" ]
                },
                "post_title": {
                    "filter": ["standard", "lowercase", "asciifolding"],
                    "type": "custom",
                    "tokenizer": "standard"
                }
            },
            "filter":{
                "ngram":{
                    "type":"edgeNGram",
                    "min_gram":2,
                    "max_gram":15
                }
            }
        }
    },
    "mappings" : {
        "npi" : {
            "properties" : {
                "npi" : { "type" : "long"},
                "replacement_npi": {"type":"long"},
                "entity_type": {"type": "integer"},
                "ein": {"type":"string"},
                "deactivation_reason_code": {"type":"string"},

                "enumerated": {"type":"date", "format":"MM/dd/YYYY"},
                "updated": {"type":"date", "format":"MM/dd/YYYY"},
                "deactivated":{"type":"date", "format":"MM/dd/YYYY"},
                "reactivated":{"type":"date", "format":"MM/dd/YYYY"},

                "sole_proprietor": {"type":"boolean"},
                "gender": {"type":"string"},
                "parent_organization": {
                    "type":"object",
                    "properties": {
                        "legal_business_name": {"type":"string"},
                        "tin": {"type":"string"}
                    }
                },
                "persons": {
                    "type":"object",
                    "properties": {
                        "primary": {
                            "type": "object",
                            "properties": {
                                "first_name": {
                                    "type":"multi_field",
                                    "path":"just_name",
                                    "fields":{
                                        "first_name":{
                                            "type":"string"
                                        },
                                        "post_title":{
                                            "type":"string",
                                            "analyzer":"post_title",
                                            "similarity":"BM25"
                                        },
                                        "autocomplete":{
                                            "analyzer":"autocomplete",
                                            "type":"string"
                                        }
                                    }
                                },
                                "last_name": {
                                    "type":"multi_field",
                                    "path":"just_name",
                                    "fields":{
                                        "last_name":{
                                            "type":"string"
                                        },
                                        "post_title":{
                                            "type":"string",
                                            "analyzer":"post_title",
                                            "similarity":"BM25"
                                        },
                                        "autocomplete":{
                                            "analyzer":"autocomplete",
                                            "type":"string"
                                        }
                                    }
                                },
                                "middle_name": {"type":"string"},
                                "prefix": {"type":"string"},
                                "suffix": {"type":"string"},
                                "credential": {"type":"string"},
                                "post_title": {"type":"string"},
                                "telephone": {"type":"string"}
                            }
                        },
                        "other": {
                            "type": "object",
                            "properties": {
                                "first_name": {
                                    "type":"multi_field",
                                    "path":"just_name",
                                    "fields":{
                                        "first_name":{
                                            "type":"string"
                                        },
                                        "post_title":{
                                            "type":"string",
                                            "analyzer":"post_title",
                                            "similarity":"BM25"
                                        },
                                        "autocomplete":{
                                            "analyzer":"autocomplete",
                                            "type":"string"
                                        }
                                    }
                                },
                                "last_name": {
                                    "type":"multi_field",
                                    "path":"just_name",
                                    "fields":{
                                        "last_name":{
                                            "type":"string"
                                        },
                                        "post_title":{
                                            "type":"string",
                                            "analyzer":"post_title",
                                            "similarity":"BM25"
                                        },
                                        "autocomplete":{
                                            "analyzer":"autocomplete",
                                            "type":"string"
                                        }
                                    }
                                },
                                "middle_name": {"type":"string"},
                                "prefix": {"type":"string"},
                                "suffix": {"type":"string"},
                                "credential": {"type":"string"},
                                "post_title": {"type":"string"},
                                "telephone": {"type":"string"}
                            }
                        },
                        "authorized_official": {
                            "type": "object",
                            "properties": {
                                "first_name": {
                                    "type":"multi_field",
                                    "path":"just_name",
                                    "fields":{
                                        "first_name":{
                                            "type":"string"
                                        },
                                        "post_title":{
                                            "type":"string",
                                            "analyzer":"post_title",
                                            "similarity":"BM25"
                                        },
                                        "autocomplete":{
                                            "analyzer":"autocomplete",
                                            "type":"string"
                                        }
                                    }
                                },
                                "last_name": {
                                    "type":"multi_field",
                                    "path":"just_name",
                                    "fields":{
                                        "last_name":{
                                            "type":"string"
                                        },
                                        "post_title":{
                                            "type":"string",
                                            "analyzer":"post_title",
                                            "similarity":"BM25"
                                        },
                                        "autocomplete":{
                                            "analyzer":"autocomplete",
                                            "type":"string"
                                        }
                                    }
                                },
                                "middle_name": {"type":"string"},
                                "prefix": {"type":"string"},
                                "suffix": {"type":"string"},
                                "credential": {"type":"string"},
                                "post_title": {"type":"string"},
                                "telephone": {"type":"string"}
                            }
                        }
                    }
                },
                "contact": {
                    "type": "object",
                    "properties": {
                        "mailing": {
                            "type": "object",
                            "properties": {
                                "address": {
                                    "type": "object",
                                    "properties": {
                                        "street1": {"type":"string"},
                                        "street2": {"type":"string"},
                                        "city": {"type":"string"},
                                        "state": {"type":"string"},
                                        "postal_code": {"type":"string"},
                                        "country": {"type":"string"}
                                    }
                                },
                                "telephone": {"type":"string"},
                                "fax": {"type":"string"}
                            }
                        },
                        "practice": {
                            "type": "object",
                            "properties": {
                                "address": {
                                    "type": "object",
                                    "properties": {
                                        "street1": {"type":"string"},
                                        "street2": {"type":"string"},
                                        "city": {"type":"string"},
                                        "state": {"type":"string"},
                                        "postal_code": {"type":"string"},
                                        "country": {"type":"string"}
                                    }
                                },
                                "telephone": {"type":"string"},
                                "fax": {"type":"string"}
                            }
                        }
                    }
                },
                "taxonomies": {
                    "type": "nested",
                    "include_in_root": "true",
                    "properties": {
                        "code": {"type":"string"},
                        "group": {"type":"string"},
                        "state_code": {"type":"string"},
                        "primary": {"type":"boolean"}
                    }
                },
                "provider_identifiers": {
                    "type": "nested",
                    "include_in_root": "true",
                    "properties": {
                        "id": {"type":"string"},
                        "type_code": {"type":"string"},
                        "state": {"type":"string"},
                        "issuer": {"type":"string"}
                    }
                }
            }
        }
    }
}

For other data, I am eager to see what you've got cooking. Specifically, one thing that you might want to consider is a way to define "rivers" automatically in the configuration for the data import. This way you can set up the DB import and automatically get the index from it.

untoldone commented 10 years ago

@anatolyg awesome stuff! -- I think it would be really cool to get to a point where this type of import definition can be generated where possible and manually created where needed. I've gotten started on a schema to describe tables + import/ update tasks on the vnext branch. For example: https://github.com/untoldone/bloomapi/blob/vnext/sources/npi/index.yml would define the import + update of the NPI and https://github.com/untoldone/bloomapi/blob/vnext/sources/hptc/index.yml would import taxonomies. I'm also looking to continue this 'source' (or 'river') implementation into visual presentation + search tools as well on the website.

anatolyg commented 10 years ago

@untoldone I didn't realize you've been pushing your changes. I'll check it out and start poring over. I agree re: import definition auto-generation. I think a good step towards that is to use the already-established pattern of mustache templates or yml files to create these indexes (or whatever the server demands). Setting up the automated rivers could be done via an executable bootstrap shell script for developer ergonomics.

As far as Postgres - are you still planning on that being the system of record? How do you want to handle relationships between various sources? For NPI it's easy, but a lot of CMS data is keyed on the "provider number" or "provider_id" which CAN BE but are not always in NPI "Other Identifiers" fields. As of right now, those are not indexed.

One awesome thing about ES is that there is ability to query across indexes, so creating an easy to use search UI to do that should be possible.

untoldone commented 10 years ago

Yep -- its a work in progress and while all the code I'm checking in generally works -- it's all a draft. Totally on the same page with you in regard to the dev tool. I've been planning on creating a new command line tool that can be used to do all data manipulation. e.g. bloomapi create:npi or bloomapi update:something etc.

I have been planing on keeping relational dbs as the system of record. The plan was to have elastic search populated via the db in some standard way.

With the new changes, bloomapi will also be normalizing data on import and will have code to join tables automatically for the api results -- this will make cross-source joining a simpler task.

anatolyg commented 10 years ago

Auto-import into ES via Rivers seems to make sense to me. Each source can define its own river with its own index and even mappings for the index.

As far as joining - have you considered outsourcing the read API to Elasticsearch wholesale? In other words, the write part of the system stays as is (Create tables/drop tables, import data). But the read part of the API becomes a layer over ES with some metadata regarding the key/id linking each source.

Super rough idea: GET /api/sources --> gets you all sources GET /api/npi/sources --> gets you all sources that can be queried via NPI GET /api/npi,cms?q=npi:10120912 --> gets you the NPI record for this NPI, plus any records in CMS matching the NPI (need to ensure that CMS NPI field is indexed and named same as in the NPI index)

untoldone commented 10 years ago

Not a bad idea. I haven't gotten to that part of the refactor yet and definitely am inexperienced with ES, but will try to learn more and review.

anatolyg commented 10 years ago

How do you feel about JSHint and Gulp/Grunt/Make?

untoldone commented 10 years ago

Heh -- JSHint has been on my todo list and I'm not averse to using Gulp/Grunt.

anatolyg commented 10 years ago

@untoldone maybe I can take that on? what do you think?

untoldone commented 10 years ago

Sounds awesome -- want to branch off of vnext?

anatolyg commented 10 years ago

yep, I've already set it up as an upstream, so I am gtg.

untoldone commented 9 years ago

Thanks for the patch, but with the rewrite in Go, this no longer makes sense to merge