ing-bank / EntityMatchingModel

Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
https://entitymatchingmodel.readthedocs.io/en/latest/
MIT License
41 stars 2 forks source link

AttributeError: 'TfidfTransformer' object has no attribute '_idf_diag' #16

Open githubmorgan opened 2 months ago

githubmorgan commented 2 months ago

Hi,

Am trying the example and when i get to

_# instantiate a matching model nm = PandasEntityMatching({ 'name_only': True, 'preprocessor': 'preprocess_merge_abbr', 'indexers': [{ 'type': 'cosine_similarity', 'tokenizer': 'words', 'ngram': 1, 'num_candidates': 5, 'cos_sim_lower_bound': 0.2, }], 'supervised_on': True, 'supervised_model_filename': 'sem_nm.pkl', 'supervised_model_dir': '.', }) nm.fit(gt)

i'm getting the following error...

AttributeError Traceback (most recent call last) Cell In[10], line 19 2 nm = PandasEntityMatching({ 3 'name_only': True, 4 'preprocessor': 'preprocess_merge_abbr', (...) 14 'supervised_model_dir': '.', 15 }) 17 # matching of names is done against the ground-truth dataset (gt). 18 # for this we need to fit our indexers to the ground-truth. ---> 19 nm.fit(gt)

File ~/.local/lib/python3.10/site-packages/emm/pipeline/pandas_entity_matching.py:251, in PandasEntityMatching.fit(self, ground_truth_df, copy_ground_truth) 249 if copy_ground_truth: 250 self.ground_truth_df = ground_truth_df.copy() --> 251 self.model = self.pipeline.fit(ground_truth_df) 252 self.n_ground_truth = len(ground_truth_df) 254 timer.log_param("n", self.n_ground_truth)

File ~/.local/lib/python3.10/site-packages/sklearn/base.py:1473, in _fit_context..decorator..wrapper(estimator, *args, **kwargs) 1466 estimator._validate_params() 1468 with config_context( 1469 skip_parameter_validation=( 1470 prefer_skip_nested_validation or global_skip_validation ... ---> 79 idf_diag = self._tfidf._idf_diag 80 idf_diag = idf_diag - scipy.sparse.diags(np.ones(idf_diag.shape[0]), shape=idf_diag.shape, dtype=self.dtype) 81 self._tfidf._idf_diag = idf_diag

AttributeError: 'TfidfTransformer' object has no attribute '_idf_diag'

mbaak commented 2 months ago

Hello, thanks for posting. Which version of sklearn are you using? Then I can try to reproduce the issue.

githubmorgan commented 2 months ago

Hi,

I’m using 1.5.0

Cheers Morgan

On Sun, 9 Jun 2024 at 6:59 PM, Max Baak @.***> wrote:

Hello, thanks for posting. Which version of sklearn are you using? Then I can try to reproduce the issue.

— Reply to this email directly, view it on GitHub https://github.com/ing-bank/EntityMatchingModel/issues/16#issuecomment-2156356191, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGNVCKOBFAY3A5IDI5CNPVDZGP4OPAVCNFSM6AAAAABJAT5LM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWGM2TMMJZGE . You are receiving this because you authored the thread.Message ID: @.***>

mbaak commented 1 month ago

Apologies for the delay. I can reproduce it - indeed it's due to changes in sklearn v1.5.0. I'll look into how to fix this.

githubmorgan commented 1 month ago

Many thanks.

On Wed, Jul 3, 2024 at 3:05 AM Max Baak @.***> wrote:

Apologies for the delay. I can reproduce it - indeed it's due to changes in sklearn v1.5.0. I'll look into how to fix this.

— Reply to this email directly, view it on GitHub https://github.com/ing-bank/EntityMatchingModel/issues/16#issuecomment-2203473312, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGNVCKJTKIYINNYTDMJ4MFTZKK6TTAVCNFSM6AAAAABJAT5LM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBTGQ3TGMZRGI . You are receiving this because you authored the thread.Message ID: @.***>

vikas-tripathi commented 1 month ago

I am having the same issue.. Many thanks for looking into it. Can you also let me know for which version of sklearn does it work before?

mbaak commented 1 month ago

We're looking at it ... For now using: scikit-learn < 1.5.0 should solve it. (I may just a quick patch release with this constraint for now.)