Open gpierard opened 9 months ago
The GenericDataSet
is something that has to be maintained on a 'best effort' basis, Pandas is such an old and complicated library that sometimes the assumptions we need to make don't hold in all cases. A good example of this is the first kwarg of pd.read_csv
is "filepath_or_buffer
" whereas for pd.read_excel
it is io
...
At a high level I think pd.read_fwf
should work since it requires no additional dependencies and meets the requirement of pointing to file path (which pd.read_clipboard
for example doesn't)
There is a chance that it our assumptions don't work, so it would be super helpful if you could let us know why and we can make the change. Additionally, would you find it helpful or annoying to have a log message for every read saying which loader has been retrieved?
So there are few ways - I think the easiest way to prove this is to copy the implementation to a locally accessible file in your repo, change the classpath of your catalog to:
raw_data_soap:
type: PartitionedDataSet
path: mypath/
dataset:
type: my_project.path.to.pandas.GenericDataSet
...
You can then use a debugger or add logging calls to check that the right loader is being used.
personally I'd find it quite helpful to have the loader specified., especially since it is not self-evident which load method is being used (or perhaps this can be added in the doc) Thanks !
Description
I am using a
GenericDataset
to read txt.gz files, but I can't seem to be able to print the actual method selected by kedro / pandas to actually read it.The GenericDataSet doc specifies that
pandas.GenericDataSet loads/saves data from/to a data file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to dynamically select the appropriate type of read/write target on a best effort basis.
and thatParameters: file_format (str) – String which is used to match the appropriate load/save method on a best effort basis. For example if ‘csv’ is passed in the pandas.read_csv and pandas.DataFrame.to_csv will be identified. An error will be raised unless at least one matching read_{file_format} or to_{file_format} method is identified.
But how can I see at runtime what method is selected? In my case that is what is defined in the Catalog
From the arguments I suppose that
read_fwf
from pandas is used (pandas.read_fwf.html).How can I make sure ? running
inspect.getsource(data_loader)
at runtime only shows the generic kedro methods, not the one actually selected by pandas to read my dataset.data_loader()
returns the already-read data.