openvenues / lieu

Dedupe/batch geocode addresses and venues around the world with libpostal
MIT License
82 stars 23 forks source link

Difficulties running dedupe_geojson out of the box and a possibly significant typo? #25

Open theloglizard opened 3 years ago

theloglizard commented 3 years ago

Hi . Interesting package which does almost exactly what I need, but I've had some difficulty getting it to run in python 3 (3.8, specifically). I spent a day hacking various bits and pieces and seem to have certain slices running, and I'm happily running the libpostal/pypostal in other contexts (great stuff, thanks!) . I imagine I have some sort of installation/package dependency issue, but I also wonder of some sort of commit/update may have failed, somewhere. For example:

class GeoJSONLineParser(GeoJSONParser):
    def __init__(self, filename):
        if filename.endswith(".bz2"):
            self.f = bz2.BZ2File(filename)
        else:
            self.f = open(filename)

    def next_feature(self):
        return json.loads(self.f.next().rstrip())

seems to be bombing with an error report:

dedupe_geojson --use-postal-code --use-zip5 --no-phone-numbers -o foo --output-filename z1 --name-dupe-threshold 0.0 name.json Word index file: foo/info_gain.index Near-dupe tempfile: foo/near_dupes Features DB: foo/features_db Output filename: foo/z1 ----------------------------- * Assigning IDs, creating near-dupe hashes + word index (using info_gain) Traceback (most recent call last): File "/.local/bin/dedupe_geojson", line 299, in for feature_id, feature in id_features(args.files): File "/.local/bin/dedupe_geojson", line 52, in idfeatures for feature in f: TypeError: iter() returned non-iterator of type 'GeoJSONLineParser'


which is easily enough patched/remedied with:
return json.loads(next(self.f).rstrip())
seems some sort of python2/python3 thing?!


Also, I think there may be a typo ("canoncal" instead of "canonical" ) at line 99 in https://github.com/openvenues/lieu/blob/master/scripts/dedupe_geojson

def is_name_address_dupe(canoncal, other, dupe_pairs, dupes, word_index=None,
                         name_dupe_threshold=DedupeResponse.default_name_dupe_threshold,
                         needs_review_threshold=DedupeResponse.default_name_review_threshold,
                         with_address=True,
                         with_unit=False,
                         use_phone_number=False,
                         fuzzy_street_names=False):

Before I commit to further hacking to get other slices running (haven't done anything with the geo features yet, for example), I thought I'd check to see about some combination:

  1. dedupe_geojson should be up and running with python 3.(?)
  2. If maybe some commit or installation feature had somehow failed or strayed
  3. Make sure lieu is still something I might expect to work.

Also, I looked around in the installation and didn't see a simple, sample input file, which would have saved me a certain amount of effort as well. As noted, I haven't sorted out all the formats and features, but in the spirit of sharing back, I attach the following json as something that seems to sort of work for me in the above call to dedupe_geojson. name.json.gz

Thanks for your attention. Good stuff, both this and libpostal. I appreciate your sharing.