ing-bank / EntityMatchingModel

Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
https://entitymatchingmodel.readthedocs.io/en/latest/
MIT License
52 stars 4 forks source link

nm.save on spark model has multiple sources found for parquet #11

Closed Mpicca closed 6 months ago

Mpicca commented 6 months ago

Even adding .format at the end to specify a source do not work. Might need to be updated in the class.

image

mbaak commented 6 months ago

Hello, Thanks for reporting this issue, I think I see the problem. The SparkEntityMatching class does not inherit from spark's DataFrameWriter but from MLWriter, which does not support the format option (b/c normally it doesn't need it). In other words .format("...") is not picked up here. However we do store the ground-truth dataset, so the format option can be relevant - by default we set this to parquet, but atm there is no overwrite option. I will add a format function, it should be straight-forward to add.

The way it will work: nm.write().format("yourformat").save(path)

Btw, looking around for this error, I'm not fully certain but it seems there's a spark package dependency conflict in your setup. See for example here: https://stackoverflow.com/questions/63579754/error-while-read-or-write-parquet-format-data You may want to look into this and resolve it as well. (The format function is only a workaround.)

mbaak commented 6 months ago

I will pick this up in the next patch release, probably in the next couple of days.

mbaak commented 6 months ago

Resolved with v2.1.0.