securefederatedai / openfl

An open framework for Federated Learning.
https://openfl.readthedocs.io/en/latest/index.html
Apache License 2.0
716 stars 197 forks source link

Setting-up the federation -- fx plan init -- does not work #72

Closed rstoki closed 2 months ago

rstoki commented 3 years ago

Describe the bug I am trying to setup a federation based on the '' following the documentation written here

The problem is, that the command fx plan initialize (as mentioned in the point 7) fails due to the checks for non-existing data folders. In default setup, it looks for path (which seems to be some 'leftovers' from your development environment), and even after specifying the local paths, it tries to look for them somewhere else.

To Reproduce

Steps to reproduce the behavior:

  1. Fresh install (windows or Linux machine), fresh conda environment (named 'open-fl'), installed pip openfl package
    • on Windows installed pip package from source (branch develop, commit: 0412c82a56264e415615ef02466c9934c6428fda)
    • fx command is running
  2. chosen template tf_2dunet
  3. Setting some custom configuration: export WORKSPACE_TEMPLATE=tf_2dunet export WORKSPACE_PATH=${HOME}/projects/my-work/openfl-federations/federation_0.2
  4. changing directory to cd ${WORKSPACE_PATH}
  5. Running: fx workspace create --prefix ${WORKSPACE_PATH} --template ${WORKSPACE_TEMPLATE}
    • the command finishes susscessfully - the workspace is created, the requirements from requirements.txt are installed via pip.
  6. running pip install -r requirements.txt manually, as mentioned in the point 6, of the tutorial is not necessary => I would suggest that fx command will not update pip requirements.
  7. running command fx plan initialize ends with the error: EXCEPTION : [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"
    • please see screenshot 1 below - screenshot from Linux machine
    • error log:

EXCEPTION : [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'" Traceback (most recent call last): File "/home/rstoklas/miniconda3/envs/open-fl/bin/fx", line 8, in sys.exit(entry()) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 194, in entry error_handler(e) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 155, in error_handler raise error File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 192, in entry cli() File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(args, *kwargs) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func return f(get_current_context(), args, kwargs) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/plan.py", line 78, in initialize task_runner = plan.get_task_runner(collaborator_cname) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 298, in get_task_runner defaults[SETTINGS]['data_loader'] = self.get_data_loader( File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 286, in get_dataloader self.loader = Plan.Build(defaults) File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 173, in Build instance = getattr(module, class_name)(**settings) File "/home/rstoklas/projects/my-work/openfl-federations/federation_0.2/code/tfbrats_inmemory.py", line 29, in init X_train, y_train, X_valid, y_valid = load_from_NIfTI(parent_dir=data_path, File "/home/rstoklas/projects/my-work/openfl-federations/federation_0.2/code/brats_utils.py", line 93, in load_from_NIfTI subdirs = os.listdir(path) FileNotFoundError: [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"

  1. Even when I change paths in the plan/data.yaml to point to the existing directories, it fails:
    • Exception:
    • Please see screenshot #2
    • Error message:

{'01-win': 'data/client-01', '02-pegas': 'data/client-02', '03-pegas': 'data/client-03'} INFO Building 🡆 Object TensorFlowBratsInMemory from code.tfbrats_inmemory Module. plan.py:168 INFO Settings 🡆 {'batch_size': 64, 'percent_train': 0.8, 'collaborator_count': 2, 'data_group_name': 'brats', 'data_path': plan.py:171 'data/client-01'} INFO Override 🡆 {'defaults': 'plan/defaults/data_loader.yaml'} plan.py:173 EXCEPTION : need at least one array to concatenate Traceback (most recent call last): File "c:\anaconda3\envs\open-fl\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "c:\anaconda3\envs\open-fl\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Anaconda3\envs\open-fl\Scripts\fx.exe__main.py", line 7, in File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 194, in entry error_handler(e) File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 155, in error_handler raise error File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 192, in entry cli() File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 829, in call return self.main(args, kwargs) File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 610, in invoke return callback(args, *kwargs) File "c:\anaconda3\envs\open-fl\lib\site-packages\click\decorators.py", line 21, in new_func return f(get_current_context(), args, kwargs) File "C:\Anaconda3\envs\open-fl\Lib\site-packages\openfl\interface\plan.py", line 77, in initialize data_loader = plan.get_data_loader(collaborator_cname) File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\federated\plan\plan.py", line 293, in get_dataloader self.loader = Plan.Build(defaults) File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\federated\plan\plan.py", line 179, in Build instance = getattr(module, class_name)(**settings) File "C:\Users\rstoklas\cernbox\work\my-projects\FL-phase-3_network\federation-0.1\code\tfbrats_inmemory.py", line 29, in init X_train, y_train, X_valid, y_valid = load_from_NIfTI(parent_dir=data_path, File "C:\Users\rstoklas\cernbox\work\my-projects\FL-phase-3_network\federation-0.1\code\brats_utils.py", line 125, in load_from_NIfTI imgs_train = np.concatenate(imgs_all_train, axis=0) File "<array_function__ internals>", line 5, in concatenate ValueError: need at least one array to concatenate

Expected behavior 1) I would expect that all steps in the tutorial will succeed 2) I would expect that at the end of the tutorial, I will end up with a working federated environment 3) I would expect that the setup tools will not require access to the data (since the setup is performed on the aggregator, and the data are on the nodes, to which the aggregator does not have access to)

Screenshots If applicable, add screenshots to help explain your problem.

Error with defaults paths: 2021-05-07 - OpenFL plan init failed

Error with modified and correct paths: 2021-05-07 - OpenFL plan init failed-2

Desktop (please complete the following information):

alexey-gruzdev commented 3 years ago

@rstoki thanks for pointing this out! So it looks like in your case, you don't have a BRATS dataset, and that's why fx plan init fails. BRATS dataset could not be downloaded automatically by our scripts (as it done for other samples), because of legal issues since you need to register to obtain it. @itrushkin added special note in tf_2dunet example, in order to have more meaningful output.

rstoki commented 3 years ago

Hello @alexey-gruzdev, thank you for your reply, but I am afraid, that the proposed solution do not fully cover the problem. The fact is that I have BraTS dataset downloaded (2018, 2019 and 2020). And also, I have copied data folders in a proper place (in my opinion).

The problems are:

  1. the default data paths in data.yaml (as is in the template) points to /raid/datasets/BraTS17/by_institution_NIfTY/*
  2. maybe the expected "data format" is not so obvious -- so maybe a proper description/explanation should be added in the documentation
  3. why the data directories are checked on the aggregator? What if on aggregator there are no data available at all? How one can setup a federation?

So what needs to be resolved:

ad 1. How user will learn, that he needs to change paths in the data.yaml file? How the user will know, that he should expect the first fx plan initialize launch to fail, then he needs to change the content of data.yaml and then run the fx plan initialize again? Or why not to use a local <workspace>/data directory path in the template's data.yaml? And the instructions for the users should explicitly mention where the BraTS data should be placed (and how).

ad 2. I have created a local structure in '/data/*' directory -- you can review the structure in attached screenshot. Obviously, this is not recognized as a valid data structure and the newly-added message "{parent_dir} does not contain subdirectories." will not help the user anyhow. So what is the exact expected data format? Could it be written in the documentation, and/or the messages for the users?

ad 3. Have you tested the tutorial setup for the use-case in which the aggregator does not posses any data? Could the description how to setup the federation in such use-case be added to the documentation? How to work-around the total failure (crash) of fx plan initialize in such case? Maybe you can add a new flag for fx plan initialize which would disable the check for the data existence on the aggregator?

I would suggest to re-open the ticked, as it is (IMHO) not solved fully yet.

Screenshot of my data-folder structure, which I believe should be sufficient. Each BraTS20_Training_*\ folder is a 1:1 copy of the respective folder from BraTS 2020 dataset (i.e., containing 5 *.nii.gz files). image

suleimank commented 3 years ago

Having the exact same problem. Can someone please update, what is the 'right' structure to place the BraTS data?

dskhanirfan commented 3 years ago

Hi I also have the same error in plan/data.yaml FileNotFoundError: [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'", I changed the path but it does not work, Please Elaborate the CORRECT Directory structure and data folder paths

itrushkin commented 3 years ago

@rstoki @suleimank @dskhanirfan Thank you for being interested in our project! Please see the instructions on how to run BraTS training in #99. You still have to apply changes from the PR to your local code for training to work. I have tested it locally with BraTS19. Feel free to report issues if you still have any.

itrushkin commented 3 years ago

In this case, collaborator data_paths must be .../federation-0.1/data/client-XX.

MasterSkepticista commented 2 months ago

Closing due to inactivity.