tokenize_and_cache cooks up wacky paths

eritain commented 3 years ago

Describe the bug Supplying a relative path to the data downloader lays a trap for tokenize_and_cache.py.

To Reproduce Call jiant/scripts/download_data/runscript.py to download some task data. Use a relative --output_path such as experiment/tasks.

Download a model, including its tokenizer.

Call jiant/proj/main/tokenize_and_cache.py to preprocess the task data for the model. Use a relative --task_config_path such as experiment/tasks/configs/taskname_config.json. It will die:

FileNotFoundError: [Errno 2] No such file or directory: 'experiment/tasks/configs/experiment/tasks/data/taskname/train.jsonl'

Expected behavior A clear and concise description of what you expected to happen.

tokenize_and_cache formulates the correct path experiment/tasks/data/taskname/train.jsonl.

Additional context Giving an absolute path to the downloader allows tokenize_and_cache to formulate the correct path and produce correct outputs. Hand-patching absolute paths into experiment/tasks/configs/taskname_config.json after the downloader creates it, but before tokenize_and_cache uses it, appears to work too.

At a minimum, or while working on a better solution, stick a warning on all examples of using the downloader, including README.md and guides/tutorials/quick_start_main.md. For extra credit, stick it in the source of both download_data/runscript.py and tokenize_and_cache.py as a comment. But the ideal thing would be to patch tokenize_and_cache to handle relative paths correctly. Forcing the downloader to build absolute paths before writing the task config would be OK too.

zphang commented 3 years ago

Hi @eritain (again), thanks so much for the detailed feedback. I took a look at the code, and I believe you are correct. The issue actually lies with jiant.tasks.retrieval.create_task_from_config, which seems to be handling relative paths improperly. Because relative paths do not currently fit within the typical workflow for jiant, and because we are planning to more broadly update the task/data handling portions of the code to rely more on HF datasets soon, I opted instead to go with your latter suggestion of forcing absolute paths (#1282, https://github.com/nyu-mll/jiant/blob/741ab09112e58039ec777c341d8267a8596cc82f/jiant/scripts/download_data/runscript.py#L47).

eritain commented 3 years ago

the typical workflow for jiant

And there's the crux of ... about half of the things I'm queueing up (and trying to desnark a little) to file as issues. There's a lot that's tacit about how Jiant is used, that a lab member at NYU would get by observation and hardly even notice that it could be otherwise. We strangers on the Internet have to either intuit that stuff, or mine the answers out of a large, sparesely commented codebase, or work it out by "drive by Braille."

(The crux of the other half of the prospective issues is proofreading.)

zphang commented 3 years ago

We have a particular workflow in mind when designing jiant, and while we try to convey this via our examples and notebooks, we can definitely be doing a better job at it. We are planning to overhaul and simplify parts of the pipeline as we shift to relying more heavily on HF libraries. We are also happy to give pointers and guidance if you're running into any issues on GitHub issues.

nyu-mll / jiant

tokenize_and_cache cooks up wacky paths #1281