neptune-ai / neptune-client

πŸ“˜ The experiment tracker for foundation model training
https://neptune.ai
Apache License 2.0
581 stars 63 forks source link

Feature Request: Stop truncating text in project datasets #1657

Open Ulipenitz opened 8 months ago

Ulipenitz commented 8 months ago

Is your feature request related to a problem? Please describe.

Related to this issue: #653

Describe the solution you'd like

As I am uploading a dataset (which does not fit on my local disk) to the project, I am uploading the dataset in a loop to the project like this:

project[DATA_PATH].append(
            stringify_unsupported(
                {
                    "tokens": ["text",...., "text"],
                    "ner_tags": ["tag",...,"tag"]
                }
            )
        )

Truncation to 1000 characters destroys my dataset. As of my knowledge, there is no other way to upload a dataset from memory (without saving to a local file) directly, so this feature would be great!

Describe alternatives you've considered

I am thinking about saving these dicts {"tokens": ["text",...., "text"], "ner_tags": ["tag",...,"tag"]} to a file in each iteration and upload it as a file (e.g. data/train/0.pkl, data/train/1.pkl ... data/train/70000.pkl). My dataset has 70.000 rows, so this is not a nice solution, since I have to make a file, upload it to neputune and delete it from local memory 70.000 times. Also when downloading the data, this will get messy as well.

SiddhantSadangi commented 8 months ago

Hey @Ulipenitz πŸ‘‹

I've passed on this feature request to the product team for consideration and will keep the thread updated.

Meanwhile, as a workaround, can you upload the dataset to Neptune as a serialized object? Given the size of the dataset, I am assuming you wouldn't need it to be in a human-readable format on Neptune (but please correct me if I am wrong)

You can upload the dataset as a pickle direct from memory by using neptune.types.File.as_pickle(). It would look like shown below:

import neptune
from neptune.types import File 

DATA_PATH = "data/train"

data = {
    "tokens": ["text",..., "text"],
    "ner_tags": ["tag",...,"tag"]
}

project = neptune.init_project()

for i in range(10):
    project[DATA_PATH][i].upload(File.as_pickle(data))

To download and use the dataset, you can download it from the run and load it using pickle:

import pickle as pkl

project[DATA_PATH][i].download()

with open(DOWNLOADED_FILE_PATH, "rb") as f:
    downloaded_dataset = pkl.load(f)

Please let me know if this would work for you πŸ™

Ulipenitz commented 8 months ago

Thank you for the quick reply!

I already tried this, but unfortunately I get an error like this:

FileNotFoundError: [Errno 2] No such file or directory: 'ABSOLUTEPATH\\.neptune\\async\\project__9701b6a4-d310-4f5f-a6e0-7827a05c1e78\\exec-1708349077.259059-2024-02-19_14.24.37.259059-5884\\upload_path\\data_dummy_data-1708349077.32419-2024-02-19_14.24.37.324190.pkl'

I used this code:

project = neptune.init_project( )
data = {"a": 0, "b": 1}
project["data/dummy_data"].upload(File.as_pickle(data))

The project folder exists, but exec-1708349077 does not.

SiddhantSadangi commented 8 months ago

This was a bug in neptune<0.19. Could you update neptune to the latest version using pip install -U neptune and try again?

Ulipenitz commented 8 months ago

Sorry, I did not realize that I was not running on the newest version. It works now! Also, your proposed solution works! Thanks for the help! :-)

SiddhantSadangi commented 8 months ago

Perfect πŸŽ‰

I'll keep the thread open in case the product team needs further details πŸš€

Ulipenitz commented 8 months ago

Quick update: Initially I tested with a subset of the data, but with the big dataset I get this error:

----NeptuneFieldCountLimitExceedException---------------------------------------------------------------------------------------

There are too many fields (more than 9000) in the [PROJECTNAME] project.
We have stopped the synchronization to the Neptune server and stored the data locally.

I will try to chunk the data, so that I won't exceed this limit, but this workaround brings in some more complexity into our project. Would be great to have bigger limits for bigger datasets.