modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 651 forks source link

Improve experience for new contributors #4573

Open devin-petersohn opened 2 years ago

devin-petersohn commented 2 years ago

From #4571:

I will note, as a new contributor the number of hoops I have to jump through, as compared to contributing to say pandas, is annoying. Autogenerated release notes, and looser requirements on how to format commit messages and issue numbers etc would probably make me more likely to contribute in the future. Just my two cents, I'm sure there are reasons for the structure!

We should use this thread to discuss improvements to the developer experience.

cc @NickCrews

jbrockmendel commented 2 years ago

just forked: not obvious how to invoke the tests. pytest modin produces a bunch of errors

mvashishtha commented 2 years ago

just forked: not obvious how to invoke the tests. pytest modin produces a bunch of errors

@jbrockmendel thank you for your feedback. pytest modin also gives me many errors:

Show errors ``` ======================================== test session starts ======================================== platform darwin -- Python 3.9.12, pytest-7.1.2, pluggy-1.0.0 benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /Users/maheshvashishtha/software_sources/modin-sql, configfile: setup.cfg plugins: benchmark-3.4.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 51733 items / 11 errors ============================================== ERRORS =============================================== ________ ERROR collecting modin/core/execution/dispatching/factories/test/test_dispatcher.py ________ modin/core/execution/dispatching/factories/dispatcher.py:129: in _update_factory cls.__factory = getattr(factories, factory_name) E AttributeError: module 'modin.core.execution.dispatching.factories.factories' has no attribute 'PandasOnFactory' During handling of the above exception, another exception occurred: modin/core/execution/dispatching/factories/test/test_dispatcher.py:19: in from modin.core.execution.dispatching.factories.dispatcher import ( modin/core/execution/dispatching/factories/dispatcher.py:312: in Engine.subscribe(FactoryDispatcher._update_factory) modin/config/pubsub.py:213: in subscribe callback(cls) modin/core/execution/dispatching/factories/dispatcher.py:150: in _update_factory raise FactoryNotFoundError( E modin.core.execution.dispatching.factories.dispatcher.FactoryNotFoundError: Cannot find a factory for partition 'Pandas' and execution engine ''. Potential reason might be incorrect environment variable value for MODIN_STORAGE_FORMAT or MODIN_ENGINE _ ERROR collecting modin/experimental/core/execution/native/implementations/omnisci_on_native/test/test_dataframe.py _ ImportError while importing test module '/Users/maheshvashishtha/software_sources/modin-sql/modin/experimental/core/execution/native/implementations/omnisci_on_native/test/test_dataframe.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: modin/experimental/core/execution/native/implementations/omnisci_on_native/utils.py:25: in from omniscidbe import PyDbEngine # noqa E ModuleNotFoundError: No module named 'omniscidbe' During handling of the above exception, another exception occurred: ../../opt/anaconda3/envs/modin-sql-dev/lib/python3.9/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) modin/experimental/core/execution/native/implementations/omnisci_on_native/test/test_dataframe.py:23: in from .utils import eval_io, ForceOmnisciImport, set_execution_mode, run_and_compare modin/experimental/core/execution/native/implementations/omnisci_on_native/test/utils.py:30: in from modin.experimental.core.execution.native.implementations.omnisci_on_native.omnisci_worker import ( modin/experimental/core/execution/native/implementations/omnisci_on_native/omnisci_worker.py:22: in from .utils import PyDbEngine modin/experimental/core/execution/native/implementations/omnisci_on_native/utils.py:27: in from dbe import PyDbEngine # noqa E ModuleNotFoundError: No module named 'dbe' _____________________ ERROR collecting modin/experimental/sql/test/test_sql.py ______________________ ImportError while importing test module '/Users/maheshvashishtha/software_sources/modin-sql/modin/experimental/sql/test/test_sql.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: ../../opt/anaconda3/envs/modin-sql-dev/lib/python3.9/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) modin/experimental/sql/__init__.py:17: in from dfsql import sql_query as query E ModuleNotFoundError: No module named 'dfsql' ________________________ ERROR collecting modin/pandas/test/test_groupby.py _________________________ ImportError while importing test module '/Users/maheshvashishtha/software_sources/modin-sql/modin/pandas/test/test_groupby.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: ../../opt/anaconda3/envs/modin-sql-dev/lib/python3.9/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) modin/pandas/test/test_groupby.py:266: in ["col1", pd.Series([1, 5, 7, 8])], # 15 modin/pandas/series.py:88: in __init__ Engine.subscribe(_update_engine) modin/config/pubsub.py:213: in subscribe callback(cls) modin/pandas/__init__.py:180: in _update_engine raise ImportError("Unrecognized execution engine: {}.".format(publisher.get())) E ImportError: Unrecognized execution engine: . _____________________ ERROR collecting modin/pandas/test/dataframe/test_iter.py _____________________ ImportError while importing test module '/Users/maheshvashishtha/software_sources/modin-sql/modin/pandas/test/dataframe/test_iter.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: ../../opt/anaconda3/envs/modin-sql-dev/lib/python3.9/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) modin/pandas/test/dataframe/test_iter.py:289: in pd.Series([1, 2, 3], dtype="int32"), modin/pandas/series.py:88: in __init__ Engine.subscribe(_update_engine) modin/config/pubsub.py:213: in subscribe callback(cls) modin/pandas/__init__.py:180: in _update_engine raise ImportError("Unrecognized execution engine: {}.".format(publisher.get())) E ImportError: Unrecognized execution engine: . ____________________ ERROR collecting modin/pandas/test/dataframe/test_pickle.py ____________________ ImportError while importing test module '/Users/maheshvashishtha/software_sources/modin-sql/modin/pandas/test/dataframe/test_pickle.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: ../../opt/anaconda3/envs/modin-sql-dev/lib/python3.9/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) modin/pandas/test/dataframe/test_pickle.py:43: in "modin_df", [pytest.param(modin_df), pytest.param(pd.DataFrame(), id="empty_df")] modin/pandas/dataframe.py:113: in __init__ Engine.subscribe(_update_engine) modin/config/pubsub.py:213: in subscribe callback(cls) modin/pandas/__init__.py:180: in _update_engine raise ImportError("Unrecognized execution engine: {}.".format(publisher.get())) E ImportError: Unrecognized execution engine: . _________________________ ERROR collecting modin/test/test_partition_api.py _________________________ modin/test/test_partition_api.py:26: in FactoryDispatcher.get_factory().io_cls.frame_cls._partition_mgr_cls._partition_class E AttributeError: 'function' object has no attribute '_partition_mgr_cls' _________ ERROR collecting modin/test/exchange/dataframe_protocol/omnisci/test_protocol.py __________ ImportError while importing test module '/Users/maheshvashishtha/software_sources/modin-sql/modin/test/exchange/dataframe_protocol/omnisci/test_protocol.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: modin/experimental/core/execution/native/implementations/omnisci_on_native/utils.py:25: in from omniscidbe import PyDbEngine # noqa E ModuleNotFoundError: No module named 'omniscidbe' During handling of the above exception, another exception occurred: ../../opt/anaconda3/envs/modin-sql-dev/lib/python3.9/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) modin/test/exchange/dataframe_protocol/omnisci/test_protocol.py:30: in from .utils import get_data_of_all_types, split_df_into_chunks, export_frame modin/test/exchange/dataframe_protocol/omnisci/utils.py:24: in from modin.experimental.core.execution.native.implementations.omnisci_on_native.test.utils import ( modin/experimental/core/execution/native/implementations/omnisci_on_native/test/utils.py:30: in from modin.experimental.core.execution.native.implementations.omnisci_on_native.omnisci_worker import ( modin/experimental/core/execution/native/implementations/omnisci_on_native/omnisci_worker.py:22: in from .utils import PyDbEngine modin/experimental/core/execution/native/implementations/omnisci_on_native/utils.py:27: in from dbe import PyDbEngine # noqa E ModuleNotFoundError: No module named 'dbe' ________________ ERROR collecting modin/test/storage_formats/cudf/test_internals.py _________________ import file mismatch: imported module 'test_internals' has this __file__ attribute: /Users/maheshvashishtha/software_sources/modin-sql/modin/test/storage_formats/base/test_internals.py which is not the same as the test file we want to collect: /Users/maheshvashishtha/software_sources/modin-sql/modin/test/storage_formats/cudf/test_internals.py HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules _______________ ERROR collecting modin/test/storage_formats/omnisci/test_internals.py _______________ import file mismatch: imported module 'test_internals' has this __file__ attribute: /Users/maheshvashishtha/software_sources/modin-sql/modin/test/storage_formats/base/test_internals.py which is not the same as the test file we want to collect: /Users/maheshvashishtha/software_sources/modin-sql/modin/test/storage_formats/omnisci/test_internals.py HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules _______________ ERROR collecting modin/test/storage_formats/pandas/test_internals.py ________________ import file mismatch: imported module 'test_internals' has this __file__ attribute: /Users/maheshvashishtha/software_sources/modin-sql/modin/test/storage_formats/base/test_internals.py which is not the same as the test file we want to collect: /Users/maheshvashishtha/software_sources/modin-sql/modin/test/storage_formats/pandas/test_internals.py HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules ---------- coverage: platform darwin, python 3.9.12-final-0 ---------- Coverage XML written to file coverage.xml ====================================== short test summary info ====================================== ERROR modin/core/execution/dispatching/factories/test/test_dispatcher.py - modin.core.execution.di... ERROR modin/experimental/core/execution/native/implementations/omnisci_on_native/test/test_dataframe.py ERROR modin/experimental/sql/test/test_sql.py ERROR modin/pandas/test/test_groupby.py ERROR modin/pandas/test/dataframe/test_iter.py ERROR modin/pandas/test/dataframe/test_pickle.py ERROR modin/test/test_partition_api.py - AttributeError: 'function' object has no attribute '_part... ERROR modin/test/exchange/dataframe_protocol/omnisci/test_protocol.py ERROR modin/test/storage_formats/cudf/test_internals.py ERROR modin/test/storage_formats/omnisci/test_internals.py ERROR modin/test/storage_formats/pandas/test_internals.py !!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 11 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ```

We do have instructions saying to use pytest modin/pandas/test here in the contributing guide. Do you think that instruction could go somewhere more obvious?

By the way, the entire test suite takes way too long on my laptop (and doesn't even cover all the operating systems and engines that the CI). When I make a change, I run the test files or even just the cases that I think are relevant, then open a PR and let Modin's Github CI Actions run all the test. Should the notes in this paragraph also be documented?

jbrockmendel commented 2 years ago

Do you think that instruction could go somewhere more obvious?

That's a good location, I just forgot to check it. Thanks.

I get 6 failures locally (on Mac). Will take a look and see if they already have issues.

FAILED modin/pandas/test/test_io.py::TestParquet::test_read_parquet_s3[object] - ModuleNotFoundError: No module named 's3fs'
FAILED modin/pandas/test/dataframe/test_udf.py::test_agg_apply[agg-str-1] - AssertionError: Series are different
FAILED modin/pandas/test/dataframe/test_udf.py::test_agg_apply[apply-str-1] - AssertionError: Series are different
FAILED modin/pandas/test/dataframe/test_udf.py::test_agg_apply_axis_names[agg-str-columns] - AssertionError: Series are different
FAILED modin/pandas/test/dataframe/test_udf.py::test_agg_apply_axis_names[apply-str-columns] - AssertionError: Series are diff...
FAILED modin/pandas/test/internals/test_benchmark_mode.py::test_syncronous_mode - AssertionError: assert False
YarShev commented 2 years ago

@jbrockmendel,

  1. to pass modin/pandas/test/internals/test_benchmark_mode.py::test_syncronous_mode you should set MODIN_BENCHMARK_MODE=True as we do here. https://github.com/modin-project/modin/blob/05933a5f27fb96f5a7ff6025ae2573d033a31b11/.github/workflows/ci.yml#L417

We should probably set the config value in the code to avoid such a confusion. What do you think?

  1. I saw the failures of modin/pandas/test/dataframe/test_udf.py::test_agg_apply as well some time ago when I ran pytest modin/pandas/test. However, when I ran pytest modin/pandas/test/dataframe/test_udf.py, the tests passed. We should investigate this behavior.

  2. Not sure why modin/pandas/test/test_io.py::TestParquet::test_read_parquet_s3 is failing on your side. It would be great if you take a closer look at it.

mvashishtha commented 2 years ago

Not sure why modin/pandas/test/test_io.py::TestParquet::test_read_parquet_s3 is failing on your side.

This looks like it's because @jbrockmendel doesn't have s3fs, which is in the dev requirements. @jbrockmendel , did you run pip install -r requirements-dev.txt?

vnlitvinov commented 2 years ago

We should probably set the config value in the code to avoid such a confusion. What do you think?

This should be useful IMHO.

Issue to generate release notes: https://github.com/modin-project/modin/issues/4747