harvard-nrg / lochness

Download your data to a data lake.
Other
5 stars 2 forks source link

`tree.get` always returns `raw` directory path #12

Open kcho opened 3 years ago

kcho commented 3 years ago

Hi,

In an example file structure that I received from Habib, the data in the source (eg. in dropbox) are arranged under either raw or processed directories.

The lochness.tree.get function returns the path of a directory, where the data should be saved locally. It also creates the folder if the folder does not exist locally.

However, lochness.tree.get function always returns raw_folder variable, which results all files, including the files in the processed folder to be downloaded in the raw directory.

I thought the files which were saved under the processed folder in the source, should return processed_folder variable rather than the raw_folder variable in the tree.get function.

https://github.com/harvard-nrg/lochness/blob/26fae812d761b45348c3c5992b08a36d3ad2d127/lochness/tree/__init__.py#L65-L83

tashrifbillah commented 3 years ago

Hi, can you give example tree.get() calls for raw and processed?

tashrifbillah commented 3 years ago

Looking at the code, isn't the fix as simple as:

if 'raw' in Templates[type]:
  return raw_folder
elif 'processed' in Templates[type]:
  return processed_folder

instead of https://github.com/harvard-nrg/lochness/blob/26fae812d761b45348c3c5992b08a36d3ad2d127/lochness/tree/__init__.py#L83

kcho commented 3 years ago

Looking at the code, isn't the fix as simple as:

if 'raw' in Templates[type]:
  return raw_folder
elif 'processed' in Templates[type]:
  return processed_folder

instead of

https://github.com/harvard-nrg/lochness/blob/26fae812d761b45348c3c5992b08a36d3ad2d127/lochness/tree/__init__.py#L83

No, currently the Template is a dictionary, defined at the top tree.__init__py. This dictionary stores information of both raw and processed local paths to save the source data according to the type.

I've added a new function to take if the data is processed or raw. https://github.com/PREDICT-DPACC/lochness/pull/1/commits/4a5ccb38de24f57a13ce940494022e88640b2b00#diff-9f1909201502c68a670bf0d0022cb9f1f74052dc8f980593cb18ffcaaaff2f40

The original tree.get function could be updated, then the new function above could be removed!

tokeefe commented 3 years ago

I propose just having it return a tuple of

return raw_folder, processed_folder

That way you call the function and it outputs both folders. One could be None, but that would be ok.

Of course, this would change the API so we'd have to hunt down everywhere tree.get is called and change it. But the change would be minimal e.g.,

raw,processed = tree.get('type', ...)