traja-team / traja

Python tools for spatial trajectory and time-series data analysis
https://traja.readthedocs.io
MIT License
98 stars 25 forks source link

Trajectory Datasets #28

Open JustinShenk opened 3 years ago

JustinShenk commented 3 years ago

Enable loading trajectory datasets via Traja API:

An early attempt, designed for Pedestrian datasets (hence, ped_id): https://github.com/traja-team/traja/blob/master/traja/datasets/dataset.py and data/loader.py.

Returns a TrajaDataFrame (a pandas DataFrame converted via trj = traja.TrajaDataFrame(df) (see https://traja.readthedocs.io/en/latest/reading.html for more on this).

A similar API to GeoPandas would be nice (https://stackoverflow.com/a/51625390/6256888), eg, traja.datasets.available. Look here for more inspiration: https://github.com/geopandas/geopandas/tree/master/geopandas/datasets.

JustinShenk commented 3 years ago

@Saran-nns was the current dataset.py written by you? Do you mind if it is hacked up to output a dataframe instead of Torch tensor?

Saran-nns commented 3 years ago

Apart from what mentioned above, dataset.py at PR #26 contains additional functions to prepare the data loaders. It burrows several utility functions from datasets.utils to extract and preprocess the data. So, I guess it is convenient to setup a new helper function at datasets.utils to create traja dataframe from the csv or available datasets, then it could be called inside datasets.utils.generate_dataset(df,n_past, n_future)

At the moment,generate_dataset(df,n_past, n_future) at datasets.utils receives pd.dataframe as input and return tensors of train and test time-series datasets along with corresponding categories(IDs) which are then fed into dataloaders.

So we expect a separate utility function for available dataset as,

def load_data(dataset:str):

    #Precheck

    try:

       dataset =  traja.datasets.utils.load_data(dataset) # read csv file using pandas

    except:

         raise exception(f'{dataset}' "is not in" f'list(traja.datasets.utils.available())')

    # Load the data

    df = pd.read_csv(dataset)

   return traja.dataframe(df)

Once this is done, we can easily set traja dataframe as default data format by replacing isinstance(pd.DataFrame) to isinstance(traja.dataframe) inside traja.datasets.utils.generate_dataset()

WolfByttner commented 3 years ago

@justinshenk the current handling is intended to be a middle ground between Torch and Pandas. The neural networks require time series and just about nothing else does, so time series are handled as tensors. However, I agree that the networks should output dataframes when they are 'done' so things can interoperate with the rest of Traja. I am just a bit unclear on the finer details of this interface.

Saran-nns commented 3 years ago

We haven't added the functions for post-training predictions/inferences yet. I will update Trainer to return the network prediction on the test dataset as traja data frame.

Saran-nns commented 3 years ago

@WolfByttner I am preparing the UML diagram for traja commit #26 . That might easily guide collaborators

WolfByttner commented 3 years ago

his (rather huge) Mallard dataset has temperature, as a possible regression parameter: https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study3109235

You also have geese here (with temps - slightly less volatile such): https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study83912796

https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study577905925 - This dataset has genders and temporal classes. Very interesting

https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study933711994