Open saggu opened 8 years ago
This sample input contains no regions or countries. Let's create a small sample dataset that contains US localities with regions and no country, and a few entries that contain locality, region and country, and the values match exactly and some where they match almost exactly.
Working on that ^^. However I ran the code on 1 input with exact matches, attached here, same output
you shouldn't use this: country = Toolkit.get_value_json(key + ".country_uri", wholestates_dicts) the key is uri, it contains "." in itself.
I updated test and Toolkit, @saggu Try it again. Thanks
I ran with the 1 input, getting
File "/Users/amandeep/softwares/spark-1.5.0-cdh5.5.1/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/Users/amandeep/softwares/spark-1.5.0-cdh5.5.1/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/amandeep/softwares/spark-1.5.0-cdh5.5.1/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
for obj in iterator:
File "/Users/amandeep/softwares/spark-1.5.0-cdh5.5.1/python/lib/pyspark.zip/pyspark/rdd.py", line 1496, in func
File "/Users/amandeep/Github/dig-entity-resolution/EntityResolution/EntityResolution/EntityResolution/test.py", line 96, in
It worked on my own laptop. Let's discuss it tomorrow.
city_dict = codecs.open(output_path + "/city_dict.json", 'w', 'utf-8') Here is the problem, the strings are unicode, not utf-8. I deleted it. The code should work now.
UTF-8 is a character encoding capable of encoding all possible characters defined by Unicode.
Yeah, but they are different when you use them as dictionary key. Try the new code to see if it works or not.
Nope, no difference at all. The keys don't even have unicode characters in them. The key http://www.geonames.org/5337542 is missing from city_dict as that city is Cisco in California with population of 0 which you have filtered out. Yesterday I changed the code to account for that, in test.py processDoc():
`
snc = Toolkit.get_value_json(eid + ".snc", wholecities_dicts)
if snc != '':
temp = Row(id=eid,value=entity.value + ","+snc,start=entity.start,end=entity.end,score=entity.score)
jsent.append(temp)
`
But it seems to have been overwritten now. Please do a git pull before pushing to Github
I am trying to run on the sample input, but I get no matches at all. @ZhengTang1120 Please take a look.
The latest code is checked in. Sample Input sample output