misconversion under some situations

intotecho commented 5 years ago

Hi Jason,

I used your default trainer to decompose about 2500 addresses that I am trying to match to GNAF (or more specifically VICMAP_ADDRESS). Thanks for posting it. It worked pretty well, though slow. Up to 7 seconds each on my micro cloud shell. Maybe this was due to the warning message.

WARNING:tensorflow:Estimator's model_fn (<function model_fn at 0x7eff508251e0>) includes params argument, but params are not passed to Estimator.
row 4 decomposing Address:  146/2 NOONE STREET CLIFTON HILL
predicting for 146/2 NOONE STREET CLIFTON HILL

These addresses were scraped from an old council document so don't follow modern standards. There is no postcode.

https://www.yarracity.vic.gov.au/-/media/files/the-area/heritage/city-of-yarra-heritage-review-appendix-8.pdf?la=en&hash=5818FC0071A12F3C6C8FB489BA0582681264F0AD

Here is a subset of the results.

NormalAddress,number_last_suffix,state,postcode,number_first,street_type,number_last,locality_name,building_name,street_name,flat_number
142 NOONE STREET CLIFTON HILL,,,,142,STREET,,CLIFTON HILL,,NOONE,
144 NOONE STREET CLIFTON HILL,,,,144,STREET,,CLIFTON HILL,,NOONE,
146 NOONE STREET CLIFTON HILL,,,,146,STREET,,CLIFTON HILL,,NOONE,
146/1 NOONE STREET CLIFTON HILL,,,,,STREET,1,CLIFTON HILL,,NOONE,146
146/2 NOONE STREET CLIFTON HILL,,,,,STREET,2,CLIFTON HILL,,NOONE,146
146/7 NOONE STREET CLIFTON HILL,,,,,STREET,7,CLIFTON HILL,,NOONE,146
146/8 NOONE STREET CLIFTON HILL,,,,48,STREET,,CLIFTON HILL,,NOONE,16
146/0 NOONE STREET CLIFTON HILL,,,,40,STREET,,CLIFTON HILL,,NOONE,16
160 NOONE STREET CLIFTON HILL,,,,160,STREET,,CLIFTON HILL,,NOONE,
162 NOONE STREET CLIFTON HILL,,,,162,STREET,,CLIFTON HILL,,NOONE,

Notice that most addressed were decomposed correctly, but 146/8 and 146/0 were converted incorrectly. Interesting that the RNN generated new numbers 16 and 48 which are not in the input data. Its repeatable. Adding a postcode does not change the behaviour.

To be sure, this is not the standard way to write a unit address. Note that 8/146 converts fine.

8/146 NOONE STREET CLIFTON HILL,,,,146,STREET,,CLIFTON HILL,,NOONE,8
0/146 NOONE STREET CLIFTON HILL,,,,146,STREET,,CLIFTON HILL,,NOONE,0

Also, when the number has a suffix, it sometimes gets added to the first_number 176B NOONE STREET CLIFTON HILL,,,,176,STREET,,CLIFTON HILL,,NOONE,,B,,,, 176C NOONE STREET CLIFTON HILL,,,,176,STREET,,CLIFTON HILL,,NOONE,,C,,,, 176D NOONE STREET CLIFTON HILL,,,,176,STREET,,CLIFTON HILL,,NOONE,,D,,,, 176E NOONE STREET CLIFTON HILL,,,,176,STREET,,CLIFTON HILL,,NOONE,,E,,,, 176G NOONE STREET CLIFTON HILL,,,,176,STREET,,CLIFTON HILL,,NOONE,,G,,,, 176G NOONE STREET CLIFTON HILL,,,,176,STREET,,CLIFTON HILL,,NOONE,,G,,,, 176H NOONE STREET CLIFTON HILL,,,,176,STREET,,CLIFTON HILL,,NOONE,,H,,,, 176I NOONE STREET CLIFTON HILL,,,,176I,STREET,,CLIFTON HILL,,NOONE,,,,,, 176J NOONE STREET CLIFTON HILL,,,,176J,STREET,,CLIFTON HILL,,NOONE,,,,,,

Also, there are some issues when the address is 88 THE ESPLANADE CLIFTON HILL,,,,88,,,CLIFTON HILL,,THE ESPLANADE,,, It does not detect THE ESPLANADE as a road_name, road_type.

jasonrig commented 5 years ago

Hi @intotecho,

Your feedback is very much appreciated! I've tried to address why I think what you're observing is occurring, but at this point I can't offer a definite resolution.

1) The model was trained using a couple of GPUs, and while CPU-only usage is fine, I think a micro cloud instance may be a little too much on the light side when it comes to available CPU cores (indeed, Google's micro cloud shell is powered by a g1-small instance, which is only 0.5 vCPUs).

2) Regarding the new numbers 16 and 48, this is a side-effect of the way in which the per-character classes are recomposed into a string. It would appear that the first and third character (1 and 6), and second and fifth character (4 and 8) are being classified differently and therefore grouped together when the final output is produced. Remember that at the core of the model is a per-character classification and the address entities are delineated simply by grouping the characters on the assumption that they're always contiguous within the address. The street number / flat numbers in the example addresses look a little odd to me (I don't think I've ever seen a street/flat number of zero, and probably neither has the RNN model).

3) As for "The Esplanade" being incorrectly classified, I suppose that it just a shortcoming of the model training.

If you would like to tweak the parameters / training data to improve the model, a good start would be to read through the pointers given in this comment I have left on medium.com. This is probably beyond the scope of the problem you're trying to solve though! All-in-all, unless the data is collected in a structured format to begin with, there will always be some loss of fidelity when trying to restructure it in a probabilistic way.

intotecho commented 5 years ago

Hi Jason,

Thanks for the response, and for publishing the RNN and article.

I didn't think the CPU would change the predictions - just make it slower.

I was getting a WARNING issued for each prediction. WARNING:tensorflow:Estimator's model_fn (<function model_fn at 0x7fded56bc1e0>) includes params argument, but params are not passed to Estimator. I suppressed this by adding params={} in

address_net_estimator = tf.estimator.Estimator(model_fn=model_fn, model_dir=model_dir, params={}) This makes it run faster.

I noticed that the types of mistakes people make in addresses is not just typos but semantics. I have a lot of addresses in my dataset that put in the wrong suburb (like Burnley or Cremorne for Richmond) or spell the street wrong, or swap the unit and the building number. Or the wrong post code. The RNN was not designed to fix these, but decomposing the address into components makes it easier for me to fix these issues.

Regards, Chris

jasonrig commented 5 years ago

The feedback is very much appreciated. Your observation that passing an empty dict to the estimator makes it run faster is curious. I would have thought it have negligible-to-no impact on run time. Out of curiosity, if you have any timing data comparing with/without params, I'd be interested to see exactly how substantial the difference is. No problem if this isn't readily available.

If you do end up making any changes that improve the output or utility of the tool, you are most welcome to submit any contributions to the codebase.

SmileSydney commented 5 years ago

Firstly, thanks @jasonrig for sharing this project. I was looking for something just like this.

I agree with @intotecho that semantic errors are a source of address mistakes. GNAF does have address_aliases but not sure that would help. This is very much the realm of the proprietary (paid) address lookup and auto-complete (auto-correct) service providers.

I too am getting the same warning WARNING:tensorflow:Estimator's model_fn (<function model_fn at 0x7f663d394268>) includes params argument, but params are not passed to Estimator. and note the slowness.

I will try the empty params as suggested by @intotecho and give you some metrics by end of the week. FYI I am using property addresses from real estate sales listings (ads) with a small sample size for the moment (<200).

SmileSydney commented 5 years ago

Timings with and without params as below using my small sample size of 231.

With params={} Number of address predictions: 231, Total time: 313.95927800000004, Average time: 1.3591310735930737

Without params={} Number of address predictions: 231, Total time: 322.9566079999999, Average time: 1.3980805541125536

jasonrig commented 5 years ago

Thanks for the details, @SmileSydney! Why this slow-down without params occurs is surprising. I'll update the code at some point in the future to include the blank dictionary since this appears to have an effect. I'll raise a separate issue to track it, and in the meantime I'd be happy to merge a PR if anyone else cares to address the issue in a more timely manner.

Slowness of the RNN in general is probably just the nature of the beast, so to speak. Since this project was developed more as a curiosity over anything else, I didn't check performance on CPU-only deployments since it ran well enough on my GPU. If you are using this code in a production environment where the current performance you're observing is problematic, I would suggest taking a look at the predict_input_fn, specifically line 499 where I'd hastily fixed the batch size for inference to 1. The batch size really should be a function parameter that is configurable according to hardware/memory limitations. This may significantly speed up inference times.

SmileSydney commented 5 years ago

@jasonrig Thanks for the batch size tip. I will give this a try. What range of values should I start experimenting with?

Per doco, batch_size: A tf.int64 scalar tf.Tensor, representing the number of consecutive elements of this dataset to combine in a single batch.

jasonrig commented 5 years ago

@SmileSydney I'm not really able to give an overly informed comment on this except to say that it should be <= the number of items in your dataset but not so big that it exhausts the memory of your device (GPU or CPU memory, depending on your system configuration). In general, I would say that during inference, increasing the batch size increases the opportunity for tensorflow to exploit data parallelism (particularly on GPU) and hence improve overall performance*. In the case of low-resourced VMs such as the g1-small instance on GCP, it's probably going to be a challenge to get much of a speedup since you'd already be asking a lot of 0.5 CPUs!

it can be a different story during training as batch size can affect convergence rates, overfitting, etc.

jasonrig / address-net

misconversion under some situations #1