iwpnd / flashgeotext

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.
MIT License
60 stars 8 forks source link

Initial impressions and questions of flashgeotext for extracting countries from affiliations #16

Closed dhimmel closed 3 years ago

dhimmel commented 4 years ago

Thanks for posting at https://github.com/elyase/geotext/issues/23#issuecomment-593490351 letting me know about this package. I'm interested in it as a way to extract countries referred to in author affiliations.

For example, here is an affiliation:

'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.

For my project, I'd like to know what countries are mentioned (either directly or inferred from a place mention inside that country).

If I run the following (with v0.2.0):

import flashgeotext.geotext
geotexter = flashgeotext.geotext.GeoText(use_demo_data=True)
affil = """\
'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, \
Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, \
Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.
"""
geo_text = geotexter.extract(affil, span_info=False)
geo_text

I get the following output:

2020-03-02 18:50:46.475 | DEBUG    | flashgeotext.lookup:add:194 - cities added to pool
2020-03-02 18:50:46.479 | DEBUG    | flashgeotext.lookup:add:194 - countries added to pool
2020-03-02 18:50:46.480 | DEBUG    | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']
{'cities': {'University': {'count': 2},
  'Saarbrücken': {'count': 1},
  'Carnegie': {'count': 1},
  'Pittsburgh': {'count': 1},
  'Berlin': {'count': 2},
  'Parys': {'count': 2}},
 'countries': {'Germany': {'count': 2},
  'United States': {'count': 1},
  'France': {'count': 2}}}

Some impressions / questions?

  1. Are the city mentions counting towards country mentions? If yes, why does "United States" not have a count of 2 for "Pittsburgh" and "USA".

  2. Is "Parys" for "Paris"... not sure why this conversion is made.

  3. Counting "University" as a city will almost always be a false positive for us, although I'm guessing this is a source data issue.

Thanks for considering this feedback / helping answer any of these questions.

iwpnd commented 4 years ago

Hi, nice of you to give it a try.

Are the city mentions counting towards country mentions? If yes, why does "United States" not have a count of 2 for "Pittsburgh" and "USA".

Country mentions are not tracked the same way that GeoText tracks them.

diagram

I wanted the user to add his/her own data to a LookupDataPool that then is looked up in a text. I don't cross-count across LookupData, rather every LookupData is looked up separately.

Is "Parys" for "Paris"... not sure why this conversion is made.

from flashgeotext.lookup import load_data_from_file
from flashgeotext.settings import DEMODATA_CITIES, DEMODATA_COUNTRIES

cities = load_data_from_file(DEMODATA_CITIES)
countries = load_data_from_file(DEMODATA_COUNTRIES)

print(cities["Paris"])
print(cities["Parys"])

Where a single keyword in LookupData looks like that:

{
"Paris" : ['Pa-ri', 'Paarys', 'Paraeis', 'Paras', 'Pari',
 'Paries', 'Parigi', 'Pariis', 'Pariisi', 'Parij',
 'Parijs', 'Paris', 'Parisi', 'Parixe', 'Pariz',
 'Parize', 'Parizh', 'Parizo', 'Parizs', 'Pariž',
 'Parys', 'Paryzius', 'Paryžius', 'Paräis', 'París',
 'Parîs', 'Párizs', 'paris', 'pyaris', 'Paris'],

[...]

"Parys" is a synonym of "Paris" and vis versa. Demo data is what it is, demo data to show the functionality. It is taken from geonames just like GeoTexts data, but flashgeotext also uses the synonyms from geonames. I use those synonyms that have a 70% fuzzy ratio with the original keyword. It might be a task to clean the demo data a little more. But what I actually want is people creating their own datasets of what they're looking for and use flashgeotext as the means to extract them.

Counting "University" as a city will almost always be a false positive for us, although I'm guessing this is a source data issue.

University is part of geonames city names for some reason, so yeah, another data problem.

Thanks for your feedback, I am really happy about it. If you have suggestions on how to improve it, I am all ears.

dhimmel commented 4 years ago

Thanks @iwpnd for the explanation. Looks like you're building a strong foundation for really-fast extraction of places from text.

As a user, I'm looking for a package that can extract the number of country mentions (including those inferred by city cross-counting) in English texts. Ideally, there'd be a way to use flashgeotext with data that closely matches my intended application, with little-to-know extra work needed.

I see how you're looking to build a lower-level application to support data that I could plug-in. However, I think you'll get a lot more users if you can also provide the data that supports their intended application. Hopefully, more of a community can develop to offload some of this work from your shoulders!

Thanks again, looking to see how this project evolves, since it seems like a sturdy foundation... but probably needs a few second-layer features to appeal to my use case including:

Cheers!

iwpnd commented 4 years ago

@dhimmel thanks for your feedback. A lot of good points. :)