mrjleo / boilernet

Boilerplate Removal using Deep Learning
MIT License
82 stars 18 forks source link

Chrome-extension issue #4

Closed ttang20913 closed 4 years ago

ttang20913 commented 4 years ago

Hi, why the unpacked chrome-extension crx file is different from the chrome extension folder provided on github?

mrjleo commented 4 years ago

Hi, could you elaborate what specifically is different? The .crx contains the built extension and a pre-trained model.

ttang20913 commented 4 years ago

image This is the unzipped version of the chrome-extension folder. The structure is different from the chrome-extension folder on github.

The background.js in two folders are also different.

mrjleo commented 4 years ago

Yeah, this is the built extension (as opposed to the source you see in the repository). You should get the same file structure if you build it yourself.

ttang20913 commented 4 years ago

OK, thanks for you help.

I have another question. How can I bypass the chrome extension to get the predictions? Suppose I have a html file downloaded. What are the pre-processing steps needed before calling tensorflow's model.predict() function?

mrjleo commented 4 years ago

This use case is not implemented, but should be easy enough to do.

You should not use the Chrome extension for this. Instead, load the a pre-trained Keras model and vocabulary and pre-process your html files. You should find the necessary pre-processing functions in net/preprocess.py. Then you can just call the model to get your predictions.

ttang20913 commented 4 years ago

Thanks for your suggestion. This is how I do the preprocessing.

from preprocess import * import tensorflow as tf

model = tf.keras.models.load_model('model.049.h5') input = ["test.html"] doc_representation, tags, words = parse(input)

for doc_feature_list, doc_label_list in get_doc_inputs(doc_representation['test.html'],words,tags):

not sure how to proceed


In background.js of the chrome extension, function getInputs() will return a tensor. However, in preprocess.py, function get_doc_inputs(), assuming this is the python version of getInputs(), will return two lists.

So, I am confused on the input format for prediction. Can you clarify the input format that is going to parsed to function model.predict(), eg: data type, vector dimensions? Thanks

mrjleo commented 4 years ago

You need to call get_feature_vector for each leaf in the document. Check line 119 in that file. That should give you the correct model inputs.

ttang20913 commented 4 years ago

I followed you suggestion and now I can get the feature vector. However, the format is still not matching with that of the model.

The error message is

ValueError: Error when checking input: expected dense_input to have 3 dimensions, but got array with shape (1052, 1)

My implementation:

from preprocess import * import tensorflow as tf

model = tf.keras.models.load_model('net/model.049.h5') input = ["net/test.html"]

############# get tag and words from training data filenames = [] filenames.extend(util.get_filenames("datasets/googletrends/prepared_html/")) data, tags_map, words_map = parse(filenames)

tags_map = get_vocabulary(tags_map, 50) words_map = get_vocabulary(words_map, 1000)

with open("tags_map.pkl","wb") as f1: pickle.dump(tags_map,f1) with open("words_map.pkl","wb") as f2: pickle.dump(words_map,f2)

##############

######## load tag and words

with open("tags_map.pkl","rb") as f1: tags_map = pickle.load(f1) with open("words_map.pkl","rb") as f2: words_map = pickle.load(f2)

########

############## get vector for test.html result, dummy1, dummy2 = parse(input)

for words_dict, tags_dict, label in result["test.html"]: feature_vector = get_feature_vector(words_dict,tags_dict,words_map,tags_map) print(feature_vector) print("prediction:",model.predict(feature_vector))

mrjleo commented 3 years ago

Hi, please create a new issue for this.