usc-isi-i2 / dig-entity-resolution

Apache License 2.0
0 stars 0 forks source link

No Matches for any input #5

Open saggu opened 8 years ago

saggu commented 8 years ago

I am trying to run on the sample input, but I get no matches at all. @ZhengTang1120 Please take a look.

The latest code is checked in. Sample Input sample output

szeke commented 8 years ago

This sample input contains no regions or countries. Let's create a small sample dataset that contains US localities with regions and no country, and a few entries that contain locality, region and country, and the values match exactly and some where they match almost exactly.

saggu commented 8 years ago

Working on that ^^. However I ran the code on 1 input with exact matches, attached here, same output

sample_1.txt sample_1_output.txt

ZhengTang1120 commented 8 years ago

you shouldn't use this: country = Toolkit.get_value_json(key + ".country_uri", wholestates_dicts) the key is uri, it contains "." in itself.

ZhengTang1120 commented 8 years ago

I updated test and Toolkit, @saggu Try it again. Thanks

saggu commented 8 years ago

I ran with the 1 input, getting File "/Users/amandeep/softwares/spark-1.5.0-cdh5.5.1/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/Users/amandeep/softwares/spark-1.5.0-cdh5.5.1/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/amandeep/softwares/spark-1.5.0-cdh5.5.1/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream for obj in iterator: File "/Users/amandeep/softwares/spark-1.5.0-cdh5.5.1/python/lib/pyspark.zip/pyspark/rdd.py", line 1496, in func File "/Users/amandeep/Github/dig-entity-resolution/EntityResolution/EntityResolution/EntityResolution/test.py", line 96, in candidates = lines.map(lambda line : processDoc(wcd,wsd,d,json.loads(line), city_dict,state_dict)) File "test.py", line 31, in processDoc snc = wholecities_dicts[eid]["snc"] KeyError: u'http://www.geonames.org/5337542'

ZhengTang1120 commented 8 years ago

It worked on my own laptop. Let's discuss it tomorrow.

ZhengTang1120 commented 8 years ago

city_dict = codecs.open(output_path + "/city_dict.json", 'w', 'utf-8') Here is the problem, the strings are unicode, not utf-8. I deleted it. The code should work now.

saggu commented 8 years ago

UTF-8 is a character encoding capable of encoding all possible characters defined by Unicode.

ZhengTang1120 commented 8 years ago

Yeah, but they are different when you use them as dictionary key. Try the new code to see if it works or not.

saggu commented 8 years ago

Nope, no difference at all. The keys don't even have unicode characters in them. The key http://www.geonames.org/5337542 is missing from city_dict as that city is Cisco in California with population of 0 which you have filtered out. Yesterday I changed the code to account for that, in test.py processDoc(): ` snc = Toolkit.get_value_json(eid + ".snc", wholecities_dicts)
if snc != '': temp = Row(id=eid,value=entity.value + ","+snc,start=entity.start,end=entity.end,score=entity.score) jsent.append(temp)

print cities_can

`

But it seems to have been overwritten now. Please do a git pull before pushing to Github