Open pmeier opened 2 years ago
I agree that we should align the two and the default being the file name is likely better than URL.
I wonder if we should add an optional argument which can be either:
I think adding 2 to _get_response_from_http
, HttpReader
, and OnlineReader
makes sense, but it may be over-engineering so I'm open to other suggestions. We probably do not want to change GDriveReader
since returning a URL for GDrive
is unlikely to be what users want.
cc: @ejguan
I agree that we should align the two and the default being the file name is likely better than URL.
I am not sure about it. If any url contains directory path, idk this is the idea behavior. For example
dp = IterableWrapper(
[
"https://abc.com/folder1/file1.txt",
"https://abc.com/folder2/file1.txt",
]
)
Returning filename by default becomes a problem for users as there will be duplicate filenames.
So I would prefer option 2 but not enabled it by default for HttpReader
. And, in order to align the behavior within OnlineReader
, we might have to provide this default functions. This makes the behavior more deverged.
I do have a question about the use case of OnlineReader
. Do we actually have such Dataset held in different remote sources? @pmeier
One potential use case might be multi-model. But, for multi-model with different data sources, they also need different pipeline to run pre-processing. Then, IMO, it makes more sense to have a separate Reader
for each data source and run a few operations after each Reader
, then combine these pipelines together.
Another thing is the other related DataPipe
s are returning URL not filename such as FSSpec
and IoPath
.
I understand Google Drive is special case because the URLs don't contain any file name or file path. So, in order to have an aligned result from OnlineReader
, it seems we have to return filename and provide a function to retrieve file path/name for urls not for google drive.
I do have a question about the use case of OnlineReader. Do we actually have such Dataset held in different remote sources?
I don't think so, no. But that shouldn't be the issue TBH. OnlineReader
is about convenience. The contract the user enters is "Here is a datapipe of URLs. Please give me back a datapipe of streams of the URL data as well as some information identifying each stream (URL / path / ...)". As a user I'm not interested what is happening in the backend, i.e. if I need a different functionality for downloading from GDrive rather than a plain HTTP object.
This is the same argument I'm making in https://github.com/pytorch/vision/pull/6060#issuecomment-1134297103 for loading of archives. In both cases as user I'm willing to trade specific control for convenience.
Note that I'm not saying that we shouldn't have the individual classes. If one for example only wants to perform plain HTTP requests, they can use the HttpReader
which has no special handling for GDrive.
Understood about the convenience for users. I am more concern about how to maintain this OnlineReader
.
So, to achieve a common ground, we might need to:
filepath_fn
for urls returned by all Reader
s (disabled by default due to BC)OnlineReader
but provide a default method to return filename. Since we can extend OnlineReader
to handle more types of URLs such as S3
, do we want to let users to provide a dictionary of functions to extract files based on URL types?Reader
s
HTTPReader
returns the URL for the "path"https://github.com/pytorch/data/blob/13b574c80e8732744fee6ab9cb7e35b5afc34a3c/torchdata/datapipes/iter/load/online.py#L46
while
GDriveReader
returns the file namehttps://github.com/pytorch/data/blob/13b574c80e8732744fee6ab9cb7e35b5afc34a3c/torchdata/datapipes/iter/load/online.py#L129
Since
OnlineReader
determines at runtime whether to call the HTTP or GDrive downloadhttps://github.com/pytorch/data/blob/13b574c80e8732744fee6ab9cb7e35b5afc34a3c/torchdata/datapipes/iter/load/online.py#L198-L203
the "path" of the yielded tuples is impossible to predict:
We should align the two. My vote is out to align based on the file name. Still, returning the URL could also be useful if redirect logic as discussed in https://github.com/pytorch/vision/pull/6060#pullrequestreview-980646299 is added to the
HTTPReader
.