nasa-jpl-memex / GeoParser

Extract and Visualize location from any file
Apache License 2.0
52 stars 23 forks source link

Elastic search integration #57

Closed aalavandhan closed 4 years ago

aalavandhan commented 8 years ago

I have a set of elastic search documents with lat/long information.

What is the best strategy to connect to elastic search and get the UI working?

MBoustani commented 8 years ago

@nithinkrishna current version of GeoParser does not support ES, do you have your data in Solr by any chance?

aalavandhan commented 8 years ago

No. I would like to add the elastic search integration. How would I go about doing this?

smadha commented 8 years ago

Hi @nithinkrishna ,

  1. Include check for elastic search in this line. #L257
  2. Create elastic search queries similar to solr in below 2 lines.
    • This will be for getting total number of docs in your ES index #L274
    • This will be for getting docs in batches from your ES index #L297

Would be great if you can add this to GeoParser such that it can support both Solr and elastic search.

Thanks

smadha commented 8 years ago

Hi @nithinkrishna - Did it worked? Let me know if you need more details

@chrismattmann @MBoustani

aalavandhan commented 8 years ago

Sure. I'm going to look into it tomorrow. I'll let know know if I have questions. thank you.

Sent from my phone, Please ignore typos if any. On Apr 1, 2016 11:57 PM, "Madhav Sharan" notifications@github.com wrote:

Hi @nithinkrishna https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_nithinkrishna&d=CwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=IWR6B1OWOYwF2ic9xJUa5g&m=nvUli4WEdp_48RFn5J0uFmPye8NuS5-fiMrD8sPfXCw&s=2Hl2cqsZYj3dFvIGeIhHaKY0KJlvB7VncxGBPImZNaU&e=

  • Did it worked? Let me know if you need more details

@chrismattmann https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_chrismattmann&d=CwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=IWR6B1OWOYwF2ic9xJUa5g&m=nvUli4WEdp_48RFn5J0uFmPye8NuS5-fiMrD8sPfXCw&s=wSpPjbBEkMmXVQLmnNhA6OC_WMRZGiKyONJzexOax98&e= @MBoustani https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_MBoustani&d=CwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=IWR6B1OWOYwF2ic9xJUa5g&m=nvUli4WEdp_48RFn5J0uFmPye8NuS5-fiMrD8sPfXCw&s=IjdzhoLceHGe_ZcZ7gV18sYaTfpwr1ZB3rU9dd040xk&e=

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_MBoustani_GeoParser_issues_57-23issuecomment-2D204659096&d=CwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=IWR6B1OWOYwF2ic9xJUa5g&m=nvUli4WEdp_48RFn5J0uFmPye8NuS5-fiMrD8sPfXCw&s=hABFWyfsMVaG8H_TVUvO55p7Op63NUarV9mx6pZsDJ8&e=

chrismattmann commented 8 years ago

thanks @nithinkrishna and @smadha for working on this

aalavandhan commented 8 years ago

@smadha @MBoustani

Correct me if I'm wrong. The function query_crawled_index in views.py extracts data from a solar index, runs the geo-topic parser on the values to identify locations and dumps it into a local solar index. Then the visualization reads from the local solar index and then plots the map.

This seems a little round about right? I understand that the formats of various inverted indices might be different, but we should be able to directly connect to solar/elastic search indices with location data.

Method 1: This is an example of my elastic search index. http://104.236.190.155:9200/polar/application-pdf/_search?from=0&size=1

As you see my object has a key called geo which contains the results of running tika's geo-tropic parser. Will I be able to configure the visualizations to directly hit elastic search and fetch the locations?

Method 2: I write a separate script to query elastic search and load the locations into the local solar index in the format which supports the visualizations. This approach seems cleaner. What is the format in which data needs to be pushed into local elastic search. I'm assuming IndexCrawledPoints is where this happens.

Once we get this workflow working, we can wrap it with an API and hook it up with the UI.

What do you recommend?

aalavandhan commented 8 years ago

Method 2 worked. It seems like the more non-intrusive approach. I will clean up my code and try to give a PR later this week.

What are your thoughts though? on a longer term strategy for the project.

smadha commented 8 years ago

Hi @nithinkrishna

Yep you got it correct about query_crawled_index in views.py.

Both your methods are new functionalities altogether. You are trying to bypass geoparsing step and instead assume that index already have geoparsed data. What would be best is if we can add ES queries in query_crawled_index in views.py. So it behaves in similar fashion with solr. If I take a geo example from your ES index, below document

admin2Code: "",
location: {
lat: 51.72703,
lon: 28.38867
},
name: "Eastern Europe",
countryCode: "",
admin1Code: ""

should be flattened to

 51.72703 28.38867 Eastern Europe  

and then we should run GeoTopicParser on top if it. Now since you have already ran GeoTopic parsing your methods make more sense but you dont need to pass geo fields. "related-publications" in your index should be a good candidate for GeoParsing.

As you said you already connected your ES to local solr I assume you must have done below 2 steps-

Sample API call - http://localhost:8000/query_crawled_index/http://localhost:8983/solr/dhs/test/user/pass

sample doc with actual geo data-

test_1 core -
      {
        "points": [
          "[{'loc_name': 'RepublicofYemen', 'position': {'y': '47.5', 'x': '15.5'}}]"
        ],
        "id": "original_id",
        "_version_": 1530733127035519000
      }

sample doc in admin core for domain test

"docs": [
      {
        "point_len_list": [
          13
        ],
        "idx_field_list": [
          "id,title"
        ],
        "core_names": [
          "test_1"
        ],
        "idx_size_list": [
          388
        ],
        "indexes": [
          "http://localhost:8983/solr/dhs"
        ],
        "id": "test",
        "_version_": 1530733127305003000
      }
    ]

If you want you can also write your own function which returns details of khooshe tiles in "return_points_khooshe" but that might not be integration

Thanks

aalavandhan commented 8 years ago

Ah got it.

Now, I understand what you expect with elastic search integration. Ideally with ES integration you want the GeoParser to,

  1. Take in an elastic-search index
  2. Iterate through each of the documents
  3. Run the metadata values on Tika GeoTropic to fetch the locations
  4. Dump these locations in LocalSolr index
  5. Then run update_idx_details on that index -> which generates the tiles
  6. The visualizations would then work on the generated tiles.

Is this the workflow you expect?

smadha commented 8 years ago

Perfecto!

This is the exact workflow. All of this is done all you need to do is step 2.

We need to put a check to identify a URL as solr/ES and then write code for iterating through each of the documents in ES.

You will need to modify query_crawled_index method. I think we might need to make query_crawled_index more granular as I expect data iteration in ES to be different that solr. I will be happy to work on this with you. If needed we can chat more hangouts - msharan@usc.edu

aalavandhan commented 8 years ago

This is a potential issue.

Using '/solr' in the URL to distinguish between solr and elastic search seems like a bad idea. Because ES doesn't have any prefix as such. It would be better if we make this decision based on a user input, say a radio button?

What do you think?

Regards,

Nithin Krishna

Linkedin https://in.linkedin.com/pub/nithin-krishna/69/809/3a9 | Github https://github.com/nithinkrishna | Blog http://nithinkrishna.github.io/

On Mon, Apr 4, 2016 at 9:40 PM, Madhav Sharan notifications@github.com wrote:

Perfecto!

This is the exact workflow. All of this is done all you need to do is step 2.

We need to put a check to identify a URL as solr/ES and then write code for iterating through each of the documents in ES.

You will need to modify query_crawled_index method. I think we might need to make query_crawled_index more granular as I expect data iteration to in ES be different that solr. I will be happy to work on this with you. If needed we can chat more hangouts - msharan@usc.edu

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_MBoustani_GeoParser_issues_57-23issuecomment-2D205641064&d=CwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=IWR6B1OWOYwF2ic9xJUa5g&m=SzDJsP1UxXYdKIaVQPhPTJThV8cPE60IRcCj85bILxk&s=Rs__NMUkPdwJymSxDRQdwg3J8LLF2CSDY0nN5YDlJFQ&e=

smadha commented 8 years ago

That's a good point. I think radio button would be the cleanest way to do it but for now if you want you can put ES code in else part of "/solr" check.

What say @chrismattmann @MBoustani ?

MBoustani commented 8 years ago

@nithinkrishna and @smadha is there any smart way of understanding whether user using Solr or ES? Like any query call which is specific for Solr and ES?

aalavandhan commented 8 years ago

@MBoustani @smadha Let's move the automated discovery of ES vs Solr to another thread

smadha commented 8 years ago

Hi @nithinkrishna - How is it going? Can we help in any way ?

aalavandhan commented 8 years ago

@smadha Ah. I've been busy with finals. I'll have a PR ready by early next week .. We can discuss more