gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

Adds language parameter to loading/indexing. Fixes #39. #115

Closed kerchner closed 3 years ago

kerchner commented 3 years ago

To Test:

# !pip install elasticsearch_dsl
from elasticsearch_dsl import Search, Q
from elasticsearch_dsl.connections import connections as es_connections
from datetime import datetime

client = es_connections.create_connection(hosts=['http://gwtweetsets-dev1.wrlc.org:9200'])
# modify index as needed in the next line
client.indices.get_mapping('tweets-cdb109')

# confirm in the output of the previous line that "language" is present

# try this again with different 'language' value:
s = Search.from_dict({'query': {'bool': {'filter': [{'term': {'language': {'value': 'es'}}}]}},
 'aggs': {'top_users': {'terms': {'field': 'user_screen_name', 'size': 10}},
  'top_hashtags': {'terms': {'field': 'hashtags', 'size': 10}},
  'top_mentions': {'terms': {'field': 'mention_screen_names', 'size': 10}},
  'top_urls': {'terms': {'field': 'urls', 'size': 10}},
  'tweet_types': {'terms': {'field': 'tweet_type'}},
  'created_at_min': {'min': {'field': 'created_at'}},
  'created_at_max': {'max': {'field': 'created_at'}}},
 'track_total_hits': True,
 '_source': ['tweet',
  'mention_user_ids',
  'user_id',
  'mention_screen_names',
  'user_screen_name']})

# modify index as needed in the next line
s._index = ['tweets-cdb109']
s.execute()
results = [result for result in s.scan()]
len(results)

# Note that results length differs when language is "es" vs. "fr" vs. "en"
lwrubel commented 3 years ago

Confirmed that loading adds a language field and querying with a language filter also works. Good to squash and merge.