hackla-engage / engage-backend

Apache License 2.0
9 stars 16 forks source link

Create flow to index new agenda items to Elasticsearch index #174

Open TeddyCr opened 5 years ago

TeddyCr commented 5 years ago

Context

Engage is working on developing a search functionality to improve the UI experience of its user. We'll use elastic search to enable users to narrow down agenda items to a specific topic.

Links

Dependencies

To-Dos

*backend endpoint should be added to the backend.engage.town/api/.... part of the site. The structure of the JSON data passed to the Elasticsearch API should be as follow

{
    "mappings": 
        {
            "properties":
                {
                    "date": 
                        {
                            "type": "date"
                        },
                    "title": 
                        {
                            "type": "text",
                            "index": true,
                            "index_phrases:" true
                        },
                    "recommendations": 
                        {
                            "type": "text",
                            "index": true,
                            "index_phrases:" true
                        },
                    "body": 
                        {
                            "type": "text",
                            "index": true,
                            "index_phrases:" true
                        },
                    "department": 
                        {
                            "type": "keyword"
                        },
                    "sponsors": 
                        {
                            "type": "text"
                        },
                    "tags": 
                        {
                            "type": "keyword"
                        }

                }
        }
}

New to the Project?

Check out our product documentation repo.

eselkin commented 5 years ago

I'd also consider adding a redundant tags array (Array of strings)... It may be better if we use the elasticsearch indices as source for analysis and updating of content rather than the PostgreSQL

eselkin commented 5 years ago

Also,date should be a date object not a string

eselkin commented 5 years ago

Tags should be a keyword datatype, and title, department, recommendations, and body should be text... However, title, recommendations, and body should also have index and index_phrases set to true.

A mapping might look like

{
    "mappings": 
        {
            "properties":
                {
                    "date": 
                        {
                            "type": "date"
                        },
                    "title": 
                        {
                            "type": "text",
                            "index": true,
                            "index_phrases:" true
                        },
                    "recommendations": 
                        {
                            "type": "text",
                            "index": true,
                            "index_phrases:" true
                        },
                    "body": 
                        {
                            "type": "text",
                            "index": true,
                            "index_phrases:" true
                        },
                    "department": 
                        {
                            "type": "text"
                        },
                    "sponsors": 
                        {
                            "type": "text"
                        },
                    "tags": 
                        {
                            "type": "keyword"
                        }

                }
        }
}

values for tags can be inserted as an array and it will make each element a keyword

TeddyCr commented 5 years ago

What do you think about making department a keyword type as well? It may be better if we use the elasticsearch indices as source for analysis and updating of content rather than the PostgreSQL do you mean fecthing agenda item on the backend directly from Elasticsearch as opposed to postgres?

eselkin commented 5 years ago

I'm not sure what I mean about the analysis yet, because we don't do any for the tagging yet, but when we do, we'd like to pull from a richer source like elasticsearch.

department is an indexed text field, it probably doesn't need the reference architecture of a keyword... but you can do either.

eselkin commented 5 years ago

Actually, I was rethinking the organization of the mappings (if we choose to do something like what Bonnie was suggesting with user identification of tags and some predefined users aliasing and tailoring the tags).

TeddyCr commented 4 years ago

agenda items ingestion flow should be added to the engage-scrapper library as a separate module (esutils.py) - as opposed to the celery task.