pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 152 forks source link

Returned "path" of `HTTPReader` and `GDriveReader` diverges #451

Open pmeier opened 2 years ago

pmeier commented 2 years ago

HTTPReader returns the URL for the "path"

https://github.com/pytorch/data/blob/13b574c80e8732744fee6ab9cb7e35b5afc34a3c/torchdata/datapipes/iter/load/online.py#L46

while GDriveReader returns the file name

https://github.com/pytorch/data/blob/13b574c80e8732744fee6ab9cb7e35b5afc34a3c/torchdata/datapipes/iter/load/online.py#L129

Since OnlineReader determines at runtime whether to call the HTTP or GDrive download

https://github.com/pytorch/data/blob/13b574c80e8732744fee6ab9cb7e35b5afc34a3c/torchdata/datapipes/iter/load/online.py#L198-L203

the "path" of the yielded tuples is impossible to predict:

from torchdata.datapipes.iter import IterableWrapper, OnlineReader

dp = IterableWrapper(
    [
        "https://raw.githubusercontent.com/pytorch/data/main/LICENSE",
        "https://drive.google.com/uc?export=download&id=1GO-BHUYRuvzr1Gtp2_fqXRsr9TIeYbhV",
    ]
)
dp = OnlineReader(dp)

for path, _ in dp:
    print(path)
https://raw.githubusercontent.com/pytorch/data/main/LICENSE
torchvision.txt

We should align the two. My vote is out to align based on the file name. Still, returning the URL could also be useful if redirect logic as discussed in https://github.com/pytorch/vision/pull/6060#pullrequestreview-980646299 is added to the HTTPReader.

NivekT commented 2 years ago

I agree that we should align the two and the default being the file name is likely better than URL.

I wonder if we should add an optional argument which can be either:

  1. a boolean that determines whether it will return the full URL or the file name
  2. a callable that maps URL to whatever the users want, with default being a function that extract the file name (last part of the URL)

I think adding 2 to _get_response_from_http, HttpReader, and OnlineReader makes sense, but it may be over-engineering so I'm open to other suggestions. We probably do not want to change GDriveReader since returning a URL for GDrive is unlikely to be what users want.

cc: @ejguan

ejguan commented 2 years ago

I agree that we should align the two and the default being the file name is likely better than URL.

I am not sure about it. If any url contains directory path, idk this is the idea behavior. For example

dp = IterableWrapper(
    [
        "https://abc.com/folder1/file1.txt",
        "https://abc.com/folder2/file1.txt",
    ]
)

Returning filename by default becomes a problem for users as there will be duplicate filenames. So I would prefer option 2 but not enabled it by default for HttpReader. And, in order to align the behavior within OnlineReader, we might have to provide this default functions. This makes the behavior more deverged.

I do have a question about the use case of OnlineReader. Do we actually have such Dataset held in different remote sources? @pmeier

One potential use case might be multi-model. But, for multi-model with different data sources, they also need different pipeline to run pre-processing. Then, IMO, it makes more sense to have a separate Reader for each data source and run a few operations after each Reader, then combine these pipelines together.

ejguan commented 2 years ago

Another thing is the other related DataPipes are returning URL not filename such as FSSpec and IoPath.

I understand Google Drive is special case because the URLs don't contain any file name or file path. So, in order to have an aligned result from OnlineReader, it seems we have to return filename and provide a function to retrieve file path/name for urls not for google drive.

pmeier commented 2 years ago

I do have a question about the use case of OnlineReader. Do we actually have such Dataset held in different remote sources?

I don't think so, no. But that shouldn't be the issue TBH. OnlineReader is about convenience. The contract the user enters is "Here is a datapipe of URLs. Please give me back a datapipe of streams of the URL data as well as some information identifying each stream (URL / path / ...)". As a user I'm not interested what is happening in the backend, i.e. if I need a different functionality for downloading from GDrive rather than a plain HTTP object.

This is the same argument I'm making in https://github.com/pytorch/vision/pull/6060#issuecomment-1134297103 for loading of archives. In both cases as user I'm willing to trade specific control for convenience.

Note that I'm not saying that we shouldn't have the individual classes. If one for example only wants to perform plain HTTP requests, they can use the HttpReader which has no special handling for GDrive.

ejguan commented 2 years ago

Understood about the convenience for users. I am more concern about how to maintain this OnlineReader.

So, to achieve a common ground, we might need to: