Understanding how Sequence Works

jamespinkerton commented 1 month ago

Hi. I looked at #4672 and found the advice when your dataset is too large is to use Sequence. However I can't find good documentation on Sequence and I'm having trouble understanding how it works.

My use case is I have multiple files on google cloud storage of floating point numbers. Each file has all of the features, but a different range of the samples. Because they're floats of 4 bytes, I can't put the entire dataset onto my machine due to lack of memory. However, I can fit it in once it's a dataset because.

I was hoping I could write a custom sequence class that downloaded these files when pinged, but when I do this I get lots of random access requests and I can't download the data that many times.

I was hoping for some advice on how the Sequence API works. Do I need to provide a list of sequences? Does the batch size refer to the number of samples returned at each index, or does it refer to the requested total number of samples at a time? Is there a way to download the data in one stream, or does the dataset have to see the data multiple times to be constructed? Does Sequence not actually solve my problem?

Thanks so much.

jameslamb commented 1 month ago

Thanks for using LightGBM.

A minimal, reproducible example of what you tried would be helpful (here are some docs on how to do that). For example, you didn't tell us what version of lightgbm you're using, what operating system, etc. Please do that in future reports.

The lightgbm.Sequence class is an abstract class, which shows the API that you need to implement. Here are some resources you might find helpful:

the source code: https://github.com/microsoft/LightGBM/blob/59a3432b9d26290fcf25ba12b82feedd05384832/python-package/lightgbm/basic.py#L889
example in this repo's examples (using multiple hdf5 files) https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/dataset_from_multi_hdf5.py
in-memory numpy implementation used in unit tests: https://github.com/microsoft/LightGBM/blob/59a3432b9d26290fcf25ba12b82feedd05384832/tests/python_package_test/test_basic.py#L98

I think Sequence is a good way to accomplish what you're trying to accomplish.

But it's been a while since I personally worked with this API, so I can't provide a reproducible example right now. When I find time, if no one else has answered your questions by then, I'll try to create one with the publicly-available data on S3 from https://github.com/ContinuumIO/anaconda-package-data to demonstrate how to do what you're trying to do.

jamespinkerton commented 1 month ago

I think you found some documentation I couldn't find. This is very helpful, thank you. Given that I have to randomly sample the data, it looks like Sequence presents some challenges. My data comes in chunks of 1 million samples / chunk, and I have about 200 or so chunks. I'm not sure if there's an easy way to do the random sampling part given this constraint?

microsoft / LightGBM

Understanding how Sequence Works #6656