Reading .xlsx files - Githubissues

genisplaja commented 1 year ago

Hello! As you might have seen, PR #560 is blocked because of an optional dependency that is missing:

E               ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.

That is because I am trying to load a dataset metadata .xlsx file using pandas.read_excel(). I tried to use different loading strategies but didn't work, that seems to be the standard way to load .xlsx files. Would openpyxl be a problematic dependency to have in mirdata, taking into account that we may have a dataset in the future that includes .xlsx files as well (I know is not common but who knows...). Otherwise we could include openpyxl as an optional dependency for the dataloader in #560. What do you think?

magdalenafuentes commented 1 year ago

Hey @genisplaja, we were discussing the issue of optional dependencies with @harshpalan the other day. We thought that a potential solution could be to add a check in loaders that need those optional dependencies, so when you initialize the dataset it will warn you that the dependency is missing and you should install it for that dataset to work. Would that fix your issue?

genisplaja commented 1 year ago

Yes! Now we were having a conversation with @nkundiushuti that maybe we could also try to change from .xslx to .csv, since we have access to the Zenodo entry, however, IMHO creating a newer version just for that is a little bit overkill.

What you actually propose I think would be nice. We could even use pipdeptree to check if the user has the particular optional dependencies installed, and if not, throw the warning.

mir-dataset-loaders / mirdata

Reading .xlsx files #562