mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.09k stars 136 forks source link

Improve local temp directory error when only `remote` is specified #683

Closed snarayan21 closed 4 months ago

snarayan21 commented 4 months ago

Description of changes:

Improves the local temp directory error when only a remote path is specified. We want to make sure that if local is unspecified, we still create an entirely new temporary directory to hold the shard files, but if a user wants to reuse their locally cached dataset, they should explicitly specify the local argument.

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

karan6181 commented 4 months ago

Can we please add a unit test for the same? :D

snarayan21 commented 4 months ago

added unit test @karan6181

snarayan21 commented 4 months ago

@karan6181 @XiaohanZhangCMU I was trying to fix the test that I wrote, but I realized there's also a test_existing_local_raises_exception in test_stream.py that tests this already. Are we good with merging the PR without me adding an additional test then?

karan6181 commented 4 months ago

@karan6181 @XiaohanZhangCMU I was trying to fix the test that I wrote, but I realized there's also a test_existing_local_raises_exception in test_stream.py that tests this already. Are we good with merging the PR without me adding an additional test then?

Yes, it's okay to merge since the unit test is already in place. To make CI pass, you might want to change the exception type from ValueError to FileExistsError in test_existing_local_raises_exception().