togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

[Errno 2] No such file or directory: 'cutoff.csv' #28

Closed Anery closed 1 year ago

Anery commented 1 year ago

Hi, I'm trying to run this test case: python3 -m cc_net --config config/test_segment.json but encountered the following error:

Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2019-09', output_dir=PosixPath('test_data2'), mined_dir='mined_by_segment', execution='debug', num_shards=4, min_shard=-1, num_segments_per_shard=1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['de', 'it', 'fr'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=0, target_size='32M', cleanup_after_regroup=False, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'minify', 'split_by_segment'], experiments=[], cache_dir=PosixPath('test_data/wet_cache'))
['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'minify', 'split_by_segment']
2023-04-27 11:41 INFO 39932:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x11b5b77c0>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x11b5b7a30>, <cc_net.perplexity.MultiSentencePiece object at 0x11b5b78e0>, <cc_net.perplexity.DocLM object at 0x11b5b7970>, <cc_net.perplexity.PerplexityBucket object at 0x11b5b7a60>, <cc_net.minify.Minifier object at 0x11b5b7be0>]
/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/flat_hash_set.py:115: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try `pip install cc_net[getpy]
  warnings.warn(
2023-04-27 11:41 INFO 39932:DuplicatesRemover - Loaded hashes from test_data2/hashes/2019-09/0000.bin (0.700GB total, took 0.02m)
2023-04-27 11:41 INFO 39932:DuplicatesRemover - Loaded 3_361_543 hashes from 1 files. (0.7GB total, took 0.02m)
2023-04-27 11:41 INFO 39932:Classifier - Loading bin/lid.bin
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/de.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/de.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/it.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/it.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/fr.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/fr.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/de.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/de.arpa.bin (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/it.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/it.arpa.bin (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/fr.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/fr.arpa.bin (took 0.0min)
Traceback (most recent call last):
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 142, in debug_executor
    message = function(*x)
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
    jsonql.run_pipes(
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 432, in run_pipes
    transform = stack.enter_context(compose(transformers))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/contextlib.py", line 429, in enter_context
    result = _cm_type.__enter__(cm)
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 312, in __enter__
    self._prepare()
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 352, in _prepare
    t.__enter__()
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 312, in __enter__
    self._prepare()
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/perplexity.py", line 267, in _prepare
    cutoffs = pd.read_csv(self.cutoff_csv, index_col=0)
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
    self.handles = get_handle(
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/common.py", line 859, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/Users/work/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'

Are there any possible reasons? Python3.9.6 on MacOS

liugs0213 commented 1 year ago

here -> https://github.com/facebookresearch/cc_net/blob/main/cc_net/data/cutoff.csv

Anery commented 1 year ago

thanks! but I still failed. here's another problem. I'm in trouble with setting mine_num_processes greater than 1, seems that a lambda function cannot be pickled:

Traceback (most recent call last):
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 142, in debug_executor
    message = function(*x)
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
    jsonql.run_pipes(
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 439, in run_pipes
    multiprocessing.Pool(
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 212, in __init__
    self._repopulate_pool()
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_mine_shard.<locals>.<lambda>'

Any suggestion will be helpful

mauriceweber commented 1 year ago

Hi @Anery, this might be due to failed installation. Did the following steps run successfully for you (run from the cc directory)?

# Installation
cd `cc_net`
mkdir data

sudo apt-get update
sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
make install
make lang=en dl_lm
Anery commented 1 year ago

Thanks for your reply, I'm running on macos, some of the pkg are not installed. I'll try on Linux latter

Anery commented 1 year ago

It works well on Linux, I’ll close this issue. Thanks.