seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.
MIT License
389 stars 60 forks source link

Using rawtext #54

Closed RoryGarlandGlam closed 2 years ago

RoryGarlandGlam commented 2 years ago

Loving the work here!

I've been trying to use the classification roberta, using pubchem_1k_smiles.txt via

if __name__ == "__main__":
    smiles_data = "pubchem_1k_smiles.txt"
    smiles_token = prebuilt_smiles_tokenizer("vocab.txt")

    example_dataset = RawTextDataset(smiles_token, smiles_data, block_size=512)

and I get the following error

Traceback (most recent call last):
  File "/Users/rorygarland/Work/bert-loves-chemistry/chemberta/utils/raw_text_dataset.py", line 234, in <module>
    example_dataset = RawTextDataset(smiles_token, smiles_data, block_size=512)
  File "/Users/rorygarland/Work/bert-loves-chemistry/chemberta/utils/raw_text_dataset.py", line 37, in __init__
    self.dataset = load_dataset("text", data_files=data_files)["train"]
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/load.py", line 548, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 462, in download_and_prepare
    self._download_and_prepare(
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 537, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 865, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False):
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/mmdeacon-l_Axq8W4-py3.8/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/datasets/text/c3b177069f0fad4da737a020bb39bbdb7aa16992e1f401e4347568618c906e28/text.py", line 95, in _generate_tables
    pa_table = pac.read_csv(
  File "pyarrow/_csv.pyx", line 1217, in pyarrow._csv.read_csv
  File "pyarrow/_csv.pyx", line 1221, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ParseOptions: delimiter cannot be \r or \n

My versions:

nlp: 0.4.0 pyarrow: 8.0.0

Is this a version issue or am I doing something very silly?

RoryGarlandGlam commented 2 years ago

Update

I moved from using

nlp.load_dataset

to

datasets.load_datasets()

And this looks to have worked