jasonrig / address-net

A package to structure Australian addresses
MIT License
194 stars 86 forks source link

Re-train model #7

Open sinhlhvn opened 5 years ago

sinhlhvn commented 5 years ago

Currently, I want to try to retrain a new model but it's hard for me. As you said, "you are free to train this model using the model_fn provided" https://github.com/jasonrig/address-net#pretrained-model So I have a question, Is the model_fn function in model.py for training a new model? If not, so how to train a new model? Could you explain it help me?

jasonrig commented 5 years ago

Hi @SinhUIT thanks for your question. I haven't included the training code I used in the repo just because the training setup can depend a lot on your system configuration (e.g. GPU, CPU, multi-GPU, distributed, cloud TPU, etc.). But my code makes use of the tensorflow Estimator API to make things easy.

If you take a look at the tensorflow docs on the Estimator API, you'll see that you can set up an Estimator object quite just using the model_fn along with optional parameters, which I do here in the code.

Once you have an Estimator object, you can use it for inference or training as you need. For this to work, you need an input_fn, which feeds data into the Estimator using the Dataset API. For this, I have provided two input_fns in my code. The first one is for training, and it applies all the random transformations to the original GNAF data (this is where you can get creative). The second one is for inference, and feeds in user-supplied free-text data directly for prediction.

If you use the training input_fn provided, you will need a tfrecord file. This is a native tensorflow binary formatted file containing the original GNAF data. The code to produce the tfrecord file from the GNAF data is here. I have chosen not to redistribute the tfrecord file because I'm unsure of whether there are any licensing issues. To get the input CSV file for this script, you'll need to set up a SQL server (e.g. Postgres) and import the SQL files as per the GNAF instructions so that you can export the "address_view" SQL view to a CSV file. All of these files are freely available and more info on that can be found on the data.gov.au page.

When you have set everything up, training should be as simple as running the train method of the Estimator object. If you want to get fancy with multi-GPU or cloud-based training, you should take a look at these docs.

jasonrig commented 5 years ago

@poorlymac you had a question earlier in issue #5 that I didn't get around to replying properly (sorry!). Perhaps you could take a look at my reply above in this thread to see if it helps.

@SinhUIT, @poorlymac, if you do manage to make any improvements, PRs are absolutely welcome.

narasimhankrishna commented 4 years ago

Hi Your explanation of training a custom model is good. But.. if yon can explain step by step it will be great. I was able to generate_tf_records with a customisation of my own csv file. Next how do I generate the model files - equal to your pre-trained directory. And do I have to modify the code in model.py, preditc.py to adjust my csv file labels and data? Please, Jason, if you can elaborate the steps, it will be a ton of help for beginners like me. Thanks and best regards NK

dylanhogg commented 4 years ago

Hi @narasimhankrishna, I some rough code to do basic CPU training in a fork that might be of use to you: https://github.com/dylanhogg/address-net/blob/master/train.py

You'll need to change the paths etc but it should get you going. If I get a chance I'll tidy it up and submit a PR back to Jason's repo here.

narasimhankrishna commented 4 years ago

@dylanhogg I generated tf records with just these lables. string_fields = ('addrnumber', 'streetname', 'locality', 'area', 'district', 'state', 'pincode') Then I tried to train with your train.py. It complains that 'tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Datasetmap_123}} Feature: building_name (data type: string) is required but could not be found.'. Does it mean that my dataset must have only those labels in the original model? How can I include my 'labels' for the training? Please share your thoughts. regards NK

narasimhankrishna commented 4 years ago

@dylanhogg and thanks for your sharing the training code. It is a giant step for me. regards

dylanhogg commented 4 years ago

@narasimhankrishna Yes I think you need to include all the columns from the GNAF dataset in your training data - I added Postgres create tables, loading and export to training csv scripts here: https://github.com/dylanhogg/address-net/commit/c945cb129d2b8719cff90636f431fdf6f0fcb817 which should help you.

Also thanks @jasonrig for your awesome work here, much appreciated.

hblandford commented 4 years ago

I have some addresses that are providing errors, when using the pretrained model. What is the best way to correct this and retrain the model? Alternatively am I thinking about this incorrectly. Should I be taking the model output and then search the GNAF dataset for the "official" address?

jasonrig commented 4 years ago

@hblandford there are two things to consider here:

  1. There will always be some inputs that are not correctly parsed simply because this is a probabilistic model and doesn't follow any pre-defined rules. You could add incorrectly parsed examples to the training data and the model may improve to a degree. Indeed, the more true examples given in the training data (as opposed to the data from GNAF and the artificially corrupted examples used to help the model generalise), the better. But this data is difficult to find (and surely expensive to generate).

  2. The model produces per-character classes (each letter is assigned one of the address components) and each letter, through a post-processing step, is grouped by class to give the final output. Because the original spelling is preserved, it means that an error in input may be correctly split by address component but still might not be a valid address.

My suggestion would be to consider how to perform some fuzzy matching against the GNAF dataset based on the result of the pre-trained model. Exactly how to do this, I don't know. There are many different approaches to string similarity matching, and you are likely to encounter errors that exist beyond spelling (e.g. incorrect street types, addresses that are on the boundary of post code or suburbs, etc.)

You might handle the above differently depending on your application. Real-time data entry (web forms, etc.) might show a list of possible entries, and offline batch processing might flag some entries for manual checking if too ambiguous. If the purpose is to geocode, then you might ignore some address components at the expense of precision.

senthangamani commented 4 years ago

hi @dylanhogg I have tried your training file, however my trained weights are giving me different results for same value. I am doing anything wrong on seed.

diogorjs commented 3 years ago

Hello @jasonrig, good job on this address parser. I need to do the same but my addresses are in Portuguese, since they are from Brazil. Although the format is pretty much the same, do you think your model will work or I need to change, for example, the lookups.py so I use the Portuguese names?

Could you please guide me on how to use the code with Portuguese based address?

Also, what you have in the training file, could you please give some examples?

Thanks and regards Diogo

jasonrig commented 3 years ago

@diogorjs from a theoretical perspective, there is no reason why this model would not work (or course it will need retraining). The purpose of this model is simple to assign a class (part of the street address) to each character in the string, so the main limiting factor is the availability of a suitable dataset. I used the freely available GNAF dataset for Australia.

About the lookups.py file, these are given for two purposes:

  1. Encoding the categorical data in the training data file
  2. Introducing diversity where there are several ways to represent the same thing (e.g. "ROAD" can sometimes just be "RD") see https://github.com/jasonrig/address-net/blob/master/addressnet/dataset.py

The content of lookups.py should be updated according to your training dataset and the addressing system in Brazil.

It's a little bit difficult for me to give you thorough guidance for adapting the code to Brazil mainly because I'm not familiar with addressing in Brazil, nor any official address datasets for Brazil. But I can summarise the general approach.

  1. A dataset in tf record format is created from a source CSV file (because this native binary format should make reading the data more efficient during training)
  2. A "Estimator" object is created (see https://github.com/jasonrig/address-net/blob/master/addressnet/predict.py#L111) using the model_fn (see https://github.com/jasonrig/address-net/blob/master/addressnet/model.py#L8). This effectively instantiates the model and give you prediction/training methods to call.
  3. A training "input function" is created
  4. Use the estimator's train method with the input function to come up with the trained model

If I were you, I would try to re-implement this model using Tensorflow v2 rather than using this code as-is. The main reason is that the API for v2 is significantly different and the documentation for v1 is harder to view now. The core part of this model is very simple (< 100 lines of code, https://github.com/jasonrig/address-net/blob/master/addressnet/model.py). The bulk of the code is not about the model itself, but about creating human-like permutations/corruptions (switching around parts of the address, misspellings, etc.) to autogenerate a dataset good enough for training the model.

Aj-232425 commented 2 years ago

Currently, I want to try to retrain a new model but it's hard for me. As you said, "you are free to train this model using the model_fn provided" https://github.com/jasonrig/address-net#pretrained-model So I have a question, Is the model_fn function in model.py for training a new model? If not, so how to train a new model? Could you explain it help me?

Have you been apply to retrain on new dataset. This is very interesting task. well, i was looking for something that expands the abbreviated address. Not only parsing, for example it should show "St" as Street, Ave as Avenue, Cir as Circle. Mainly looking for US address. Any help would be appreciated.

Also, amazing work done by Author of the repo. Thanks

poorlymac commented 2 years ago

@Aj-232425 sorry I haven't got around to doing it myself.

Aj-232425 commented 2 years ago

Hi @dylanhogg @jasonrig , I hope you are doing well. Following steps explained by you, I was able to generate TF records. Even I was able to create model , but it was not impactful, giving very bad and unusual results. Somehow I found, while training I am getting an error 0-th value returned by pyfunc_0 is int32, but expects int64 I am not getting what exactly I am doing wrong. Is there any way if you could help me out on that.

Regards

dylanhogg commented 2 years ago

Hey @Aj-232425,

Do you have a branch with the training code you're running? Log output would be useful too, if possible.

It's been quite a while since I looked at this project. I do recall being able to train a new model that compared roughly to the pretrained model he supplied, the code I used was in my fork here: https://github.com/dylanhogg/address-net/commit/05e3a849663688e19e95ae507220496418629be5 which also pins tensorflow==1.15

I'll try to loop back to this in the next few weeks. I agree with what Jason mentioned in this thread, rewriting in TF2 would be a good idea, perhaps that is something to try?

Aj-232425 commented 2 years ago

Hey @dylanhogg , Thanks for replying. Yes I also have used tensorflow==1.15. So, what I did is generated TF records using generate_tf_record.py file. Also, I have used the same format dataset with same number of columns as used in original repo. Later I tried training it using train.py file contains your training code. During training I am facing this error. Showing error somewhere in session.py file of TF. All I found in log file(located at address-net-master) is :

2022-08-26 13:04:34,686 [INFO] main: tfrecord_input_file=G:/Dataset/GNAF.record 2022-08-26 13:04:34,686 [INFO] main: model_output_file=G:/Dataset/mod/ 2022-08-26 13:04:34,686 [INFO] main: Get estimator... 2022-08-26 13:04:34,702 [INFO] main: Load dataset... 2022-08-26 13:04:34,702 [INFO] main: Training model...

dylanhogg commented 2 years ago

Hey @Aj-232425,

I've put together a (very rough) upgrade to TF2 here: https://github.com/dylanhogg/address-net/tree/tf2

I pulled fresh Aug22 GNAF data and regenerated the address_view csv and tfrecord datasets. Then locally I can train a new model with TF v2.9 (this code installs tensorflow-macos for Apple M1, but it should work for any distribution) and run predictions with it. I haven't trained on much data, so the model I got was poor quality, however it proved the end to end process.

I hope that helps.

Aj-232425 commented 2 years ago

Thank you so much @dylanhogg for this. Well , I was able to retrain model on existing tf version 1.15. But, I would definitely love to work further on your approach of using TF2, that would be really helpful. I really appreciate it. Regards.

Aj-232425 commented 2 years ago

Hi @dylanhogg , I hope you are doing well. I have few queries, first can you please confirm while training model is it require to have data in billions. Or half a million data can give good result also, cause i almost provided one million address data but still result is not up to the mark. One more question, is there any way I can find the confidence score of the result. or any thing that shows result are how much accurate by comparing with actual result. Thanks & regards