Open domeniconappo opened 4 years ago
Huh, that's frustrating. I really didn't change that much beyond bumping the versions, so I'm not sure where the slowdown is coming from. Do you have a document that produces the ValueError
that you can share?
I just start with mordecai last week, but i got the same problem described by @domeniconappo . After a lot of tests changing versions, trying to use cuda etc... nothing changed. Then i gave a try on jupyter notebook. I don't know why, but analysis became a lot faster. The only lib version that differs from @domeniconappo and my own old script is tensorflow (1.14.0 installed by conda)
Hi @ahalterman, even I am getting the same issue while using the package. The issue is occurring due to the identification of some irrelevant terms as geo terms in my case. After the code lookup, I found out that in geoparse.py in line# 731 while we call this:
prediction
= self.country_model.predict(i['matrix']).transpose()[0]
the matrix for the word generated is empty and of shape (1,0).
So let me know if we can filter out the below code based on the empty matrix(in line# 722 geoparse.py):
feat
= self.make_country_matrix(loc).
Example of the geo-terms identified which are causing the issue:
{'labels': [], 'matrix': matrix([], shape=(1, 0), dtype=float64), 'word': 'organomercury'}
{'labels': [], 'matrix': matrix([], shape=(1, 0), dtype=float64), 'word': 'orangeiron'}
{'labels': [], 'matrix': matrix([], shape=(1, 0), dtype=float64), 'word': 'redoxygen'}
{'labels': [], 'matrix': matrix([], shape=(1, 0), dtype=float64), 'word': 'FeC10(HgCl)10'}
[{'text': 'organomercury', 'label': '', 'word': 'organomercury', 'spans': [{'start': 900, 'end': 913}], 'features': {'maj_vote': '', 'word_vec': '', 'first_back': '', 'most_alt': '', 'most_pop': '', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': '0', 'class_mention': '', 'code_mention': ''}}, {'text': 'Pbca', 'label': '', 'word': 'Pbca', 'spans': [{'start': 4644, 'end': 4648}], 'features': {'maj_vote': '', 'word_vec': '', 'first_back': 'POL', 'most_alt': 'CHN', 'most_pop': 'MEX', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': '0', 'class_mention': '', 'code_mention': ''}}, {'text': 'orangeiron', 'label': '', 'word': 'orangeiron', 'spans': [{'start': 6157, 'end': 6167}], 'features': {'maj_vote': '', 'word_vec': '', 'first_back': '', 'most_alt': '', 'most_pop': '', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': '0', 'class_mention': '', 'code_mention': ''}}, {'text': 'redoxygen', 'label': '', 'word': 'redoxygen', 'spans': [{'start': 6184, 'end': 6193}], 'features': {'maj_vote': '', 'word_vec': '', 'first_back': '', 'most_alt': '', 'most_pop': '', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': '0', 'class_mention': '', 'code_mention': ''}}, {'text': 'metallocene moiety', 'label': '', 'word': 'metallocene moiety', 'spans': [{'start': 6935, 'end': 6953}], 'features': {'maj_vote': '', 'word_vec': 'GNQ', 'first_back': '', 'most_alt': '', 'most_pop': '', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': 4.130288124084473, 'class_mention': '', 'code_mention': ''}}, {'text': '3.447(1)Å (Figure1C', 'label': '', 'word': '3.447(1)Å (Figure1C', 'spans': [{'start': 7585, 'end': 7604}], 'features': {'maj_vote': '', 'word_vec': 'TUR', 'first_back': '', 'most_alt': '', 'most_pop': '', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': 1.3494553565979004, 'class_mention': '', 'code_mention': ''}}, {'text': 'FeC10(HgCl)10', 'label': '', 'word': 'FeC10(HgCl)10', 'spans': [{'start': 12695, 'end': 12708}], 'features': {'maj_vote': '', 'word_vec': '', 'first_back': '', 'most_alt': '', 'most_pop': '', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': '0', 'class_mention': '', 'code_mention': ''}}, {'text': 'Deutsche Forschungsgemeinschaft', 'label': '', 'word': 'Deutsche Forschungsgemeinschaft', 'spans': [{'start': 13577, 'end': 13608}], 'features': {'maj_vote': '', 'word_vec': 'DEU', 'first_back': '', 'most_alt': '', 'most_pop': '', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': 10.370280265808105, 'class_mention': '', 'code_mention': ''}}, {'text': 'ZEDAT/FU Berlin', 'label': '', 'word': 'ZEDAT/FU Berlin', 'spans': [{'start': 13713, 'end': 13728}], 'features': {'maj_vote': '', 'word_vec': 'DEU', 'first_back': '', 'most_alt': '', 'most_pop': '', 'ct_mention': '', 'ctm_count1': 0, 'ct_mention2': '', 'ctm_count2': 0, 'wv_confid': 11.895607948303223, 'class_mention': '', 'code_mention': ''}}]
Hi @ahalterman, I did the changes in geoparse.py and the issue is not occurring now. Let me know if the below code changes can be committed and pushed. geoparse.txt
@vupadhyaya19: can you open a pull request with your changes?
I'm hoping to make v3 public in July and that should resolve the issue because it switches from TF to pytorch, but I'd like to leave this version in a usable form for people who might stick with it.
@vupadhyaya19: can you open a pull request with your changes?
I'm hoping to make v3 public in July and that should resolve the issue because it switches from TF to pytorch, but I'd like to leave this version in a usable form for people who might stick with it.
Hi, @ahalterman ! First of all, thank you for your job!
Looks like I have same issue described above, so:
Hi, updating mordecai to 2.1.0 and dependencies: tensorflow to 2.3.0 spacy to 2.3.2 keras to 2.4.3
Our geocoding processing now is much slower as we've started to observe lots of errors printing to console like the following:
ValueError: Input 0 of layer sequential is incompatible with the layer: expected axis -1 of input shape to have value 12 but received input with shape [None, 0]
It's not clear how this is influencing geocoding but for sure it's much slower as our queues are constantly building up and accumulating documents to be geoparsed.
Can you help? Is it a problem with deps versions?
Thank you in advance and for your great work!