labdmitriy commented 2 years ago

Code of Conduct

Description

Add group time series cross-validator implementation. Add tests with 100% coverage using pytest. I decided to create pull request before creating documentation and change log modification, to discuss current implementation and further steps to implement.

Related issues or pull requests

Fixes #910

Pull Request Checklist

[x] Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
[X] Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
[x] Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
[X] Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
[X] Checked for style issues by running flake8 ./mlxtend

labdmitriy commented 2 years ago

Sorry, mistakenly added 2 files (settings.json and conda_requirements.txt), removed it using additional commits.

labdmitriy commented 2 years ago

Hi Sebastian,

During the implementaion of requested changes, I have some questions:

When I changed something in my feature branch, and upstream/master is changed while the feature development, should I merge master branch to my feature branch even if I don't have any conflict changes? Sorry if it is basic question, but I tried to search best practices and didn't find common solution for it. I tried to rebase but as I understand it is not good idea when I already push my feature branch to public, and also I had diverged branches after that. So I decided to just change my code and commit/push it without any merging and rebasing now.
sklearn has the following statement about __init__() method in custom estimators: "There should be no logic, not even input validation, and the parameters should not be changed." In the current implementation of GroupTimeSeriesSplit, in init() method I checked only parameters that do not require any calculations, and the rest of the logic I implemented in split() method with invocation of the self._calculate_split_params() after these checks. I think that it looks pretty weird. What do you think, what will the best option here:
- To move all checks including self._calculate_splitparams() to __init_\() method
- To move all checks to split() method
- To keep everything "as is" Update: I remember that groups are used only in split() method, not __init__(), that's why I implemented several checks in split() method.
Where is the best place to discuss this feature - here in pull request or in related issue?
When you have time, will it be possible to see the questions from related issue? Probably I had too many questions, and I am ready to group the questions somehow if it is required.

Thank you.

pep8speaks commented 2 years ago

Hello @labdmitriy! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2022-05-25 06:50:32 UTC

labdmitriy commented 2 years ago

I've implemented additional changes:

Lessen the restrictions for the group sequences. Now it is not required to be sorted in increasing order, only to be consecutive. The reason is that now it can accept not only group numbers but also group names, where sorting can be confusing.
Add several tests for group names.
I saw that you reformatted the code with black8, and I decided to do this for PR code. It seems that now 79 chars is required max line length (corresponding to pep8speaks bot messages), so I changed default black8 line length value from default (88) to 79 and reformatted it again.

labdmitriy commented 2 years ago

There are several updates of __init__.py because imports are automatically sorted by VS Code (using built-in isort tool) whereas imports in the file are not sorted, so for not changing your imports order I did it separately in plain text editor. Should I resolve conflicts myself in this file?

rasbt commented 2 years ago

Thanks a lot for the PR! Sorry, I recently got really swamped with tasks due to some paper reviews and following up on other PRs! I will hopefully have more time again soon!

When I changed something in my feature branch, and upstream/master is changed while the feature development, should I merge master branch to my feature branch even if I don't have any conflict changes? Sorry if it is basic question, but I tried to search best practices and didn't find common solution for it. I tried to rebase but as I understand it is not good idea when I already push my feature branch to public, and also I had diverged branches after that. So I decided to just change my code and commit/push it without any merging and rebasing now.

This is a good point. Unless it affects your code, there is nothing to worry about. GitHub will automatically flag conflicts if they appear and we can try to deal with them then in the webinterface.

Btw I resolved one of the issues, and you may have to pull on your end.

sklearn has the following statement about init() method in custom estimators: "There should be no logic, not even input validation, and the parameters should not be changed." In the current implementation of GroupTimeSeriesSplit, in init() method I checked only parameters that do not require any calculations, and the rest of the logic I implemented in split() method with invocation of the self._calculate_split_params() after these checks. I think that it looks pretty weird. What do you think, what will the best option here

That's a tricky one. Personally, I feel like it's fine to check e.g., input arguments etc. It's not strictly scikit-learn consistent, but I don't see any harm in it to be honest. So, I am just checking the BootstrapOutOfBag here and I also have an init check here. I don't recall it ever causing any issues (https://github.com/rasbt/mlxtend/blob/master/mlxtend/evaluate/bootstrap_outofbag.py#L44).

Where is the best place to discuss this feature - here in pull request or in related issue?

I'd say it's best to discuss it here so that we don't have to jump back and forth too much.

When you have time, will it be possible to see the questions from related issue? Probably I had too many questions, and I am ready to group the questions somehow if it is required.

No worries, it's all good! :). I try to keep on top of things, but I am a little bit time constrained this week.

rasbt commented 2 years ago

Good point regarding the pep8/black discrepancy. I was thinking about this a bit, but maybe it's just time to adjust to more modern times and use the 88 character limit rather than the 79.

Then, if users are strict about 79 chars, it should still be okay when they submit it. On the other hand, if they are a the 88 lines, they don't get a complain either. I think this will make the PRs also a bit more frictionless while still maintaining recommended styles. I will adjust the pep8 checker via #920

labdmitriy commented 2 years ago

Hi Sebastian,

Thanks a lot for the PR! Sorry, I recently got really swamped with tasks due to some paper reviews and following up on other PRs! I will hopefully have more time again soon!

No problem, I will also have more free time 2 next weeks, so I can answer more quickly too.

Btw I resolved one of the issues, and you may have to pull on your end.

I guess you mean black reformatting and resolving conflict in __init__.py - I will pull it in my local feature branch.

That's a tricky one. Personally, I feel like it's fine to check e.g., input arguments etc. It's not strictly scikit-learn consistent, but I don't see any harm in it to be honest. So, I am just checking the BootstrapOutOfBag here and I also have an init check here. I don't recall it ever causing any issues (https://github.com/rasbt/mlxtend/blob/master/mlxtend/evaluate/bootstrap_outofbag.py#L44).

Great! Then I will keep it 'as is'.

I'd say it's best to discuss it here so that we don't have to jump back and forth too much.

Great, ok!

No worries, it's all good! :). I try to keep on top of things, but I am a little bit time constrained this week.

Ok, thank you!

I also noticed that lint checking was not successful after your last merge to origin/group-times-series, I think because of 2 reasons:
- After linting with black8 with default parameters, my reformatting with line length = 79 was changed again to 88.
- You deleted trailing comma in __init__.py, but black8 intentionally adds a trailing comment in this case (and I added it also in one of the previous commits): https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#trailing-commas
So because you changed pep8 configuration to 88, I reconfigured black/flake8 in my configuration and reformatted files again (including trailing comma) - and now it was checked successfully.
Are you considering the option to automatically sort the imports using isort?
For my projects I use pre-commit hooks for autoformatting/linting, it seems very convenient and probably reduces number of review iterations. Maybe such predefined configuration will be useful for contributors?
Do I need to prepare something for the next implementation steps until you have more free time?
I noticed that there are some typos in Quick Contributor Checklist after updating to black:
1. [x] Check the autimated tests passed.
2. [x] The atuomatic PEP8/black integrations may prompt you to modify the code stylistically. It would be nice if you could apply the suggested changes.

Thank you.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

labdmitriy commented 2 years ago

I decided to experiment with documentation generation and have the following results and notes:

I prepared Jupyter notebook with successful and failed cases which are consistent with test cases. I think it can be used as a documentation draft. Also I found that if we install black[jupyter] then black will also format Jupyter Notebooks, so I used black formatting also in notebook (now repository CI skips it).
To run make_api.py correctly I should:
- Install libraries: dlib / cmake / imageio / scikit-image
- Download http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2 (62 MB size)
To run ipynb2markdown.py correctly I should:
- If I should run it for specific notebook, I should change argument from --ipynb_path specified in contribution guide to --ipynb, because there is no --ipynb_path argument for this script now
- Run command not only for the new notebook but also for all notebooks, to have .md files for all notebooks (otherwise there will be errors while executing mkdocs serve as the next step): python ipynb2markdown.py -a . It will generate .md files and also _files directories for images, and I found that _files are in version control, whereas .md do not. As I understand _files directories do not make sense without *.md files, so I haven't added a folder yet.
I added required lines to mkdocs.yml and USER_GUIDE_INDEX.md
After all these steps, I executed mkdocs serve and can find the section in the correct location in the menu.

Probably I didn't make all required steps, but I hope it will be useful as the base for the next steps and we can discuss which content is required for this notebook.

rasbt commented 2 years ago

There are several updates of init.py because imports are automatically sorted by VS Code (using built-in isort tool) whereas imports in the file are not sorted, so for not changing your imports order I did it separately in plain text edito

Oh, sorry that you thought you had to preserve it and had to go through that extra step. I think the sorted imports are actually nice. PS: Had no idea this plugin existed for VSCode and I just installed it myself ;)

rasbt commented 2 years ago

I also noticed that lint checking was not successful after your last merge to origin/group-times-series, I think because of 2 reasons: After linting with black8 with default parameters, my reformatting with line length = 79 was changed again to 88.

I just added the black formatting a few days ago and haven't noticed that yet, but you are absolutely right, that's an annoying issue. I reconfigured the pep8 checker to 88 chars now so that the black and pep8 linter don't conflict with each other.

Perhaps the pep8 linter can even be removed. Or the black linter. I just wanted to use both for a while to see which one is the better option. Do you have any ideas and feedbacl on this?

You deleted trailing comma in init.py, but black8 intentionally adds a trailing comment in this case (and I added it also in one of the previous commits):

Oh I am sorry! My bad!

For my projects I use pre-commit hooks for autoformatting/linting, it seems very convenient and probably reduces number of review iterations. Maybe such predefined configuration will be useful for contributors?

Actually that's a good call. Right now, I have my VSCode setup to do that on "save" but I think that pre-commit hooks are probably a better option. I think that would be good to add to the contributor list if we keep black. Otherwise it can get annoying I think (you are the first tester of black here :P, sorry for the trouble)

rasbt commented 2 years ago

I noticed that there are some typos in Quick Contributor Checklist after updating to black: Check the autimated tests passed. The atuomatic PEP8/black integrations ?> may prompt you to modify the code stylistically. It would be nice if you could apply the suggested changes.

Ahh thanks, I will fix them shortly. The problem is my markdown editor has no spell checking at the moment 😅

rasbt commented 2 years ago

Do I need to prepare something for the next implementation steps until you have more free time?

I would say: no worries for now! I just read your excellent Medium guide, and next I still need to take a look at the code files in the next few days :)

Also I found that if we install black[jupyter] then black will also format Jupyter Notebooks, so I used black formatting also in notebook (now repository CI skips it).

I forgot how I set it up, but yeah I have a Jupyter Lab extension now that formats notebooks as well, which is very nice! Just curious, it sounds like the mlxtend CI complained about notebook formatting before? Maybe I should add a note about that in the contributor guide then. Or did you mean that it generally skips notebooks for now?

To run make_api.py correctly I should: ...

I think this is unfortunately necessary. I think it would be nice to have a documentation-requirements.txt. But you don't have to build the make_api.py to be honest. It's really just for the last cell in the notebook, and I can take care of that. Once we have the notebook ready. No worries. Regarding dlib though, I think it was usually painful to install, but I think I installed that relatively painlessly with conda-forge recently, and I didn't need to download anything manually.

If I should run it for specific notebook, I should change argument from --ipynb_path specified in contribution guide to --ipynb, because there is no --ipynb_path argument for this script now

Wow, another good catch. Will update this! Yeah, it should just be just

python ipynb2markdown.py --ipynb path/to/the/notebook.ipynb

Probably I didn't make all required steps, but I hope it will be useful as the base for the next steps and we can discuss which content is required for this notebook.

No worries in case things are missing, I can always help with that.

It will generate .md files and also _files directories for images, and I found that _files are in version control, whereas .md do not. As I understand _files directories do not make sense without .md files, so I haven't added a folder yet.

Hmmm, it's been a long time, but I think the _files were required for the website, but I now can't see why they would. They could potentially be excluded from version control, which also would help preventing bloating the repository size. I will look into this separately some time and see if I can remove that from .git without affecting the documentation.

rasbt commented 2 years ago

Let me know if I forgot to address anything :)

labdmitriy commented 2 years ago

Hi Sebastian,

Oh, sorry that you thought you had to preserve it and had to go through that extra step. I think the sorted imports are actually nice. PS: Had no idea this plugin existed for VSCode and I just installed it myself ;)

Yes, I also found it after installing myself many times :) But I still install it explicitly to have environment which not depends on IDE, and I can use it for another purposes (for example, pre-commit hook).

I just added the black formatting a few days ago and haven't noticed that yet, but you are absolutely right, that's an annoying issue. I reconfigured the pep8 checker to 88 chars now so that the black and pep8 linter don't conflict with each other.

I also just started using black in your library :) I used yapf before. I've read the documentation for black, and found, that for flake8 it is better to also ignore E203 warning (to avoid conflicts with black rules): https://black.readthedocs.io/en/stable/guides/using_black_with_other_tools.html

Perhaps the pep8 linter can even be removed. Or the black linter. I just wanted to use both for a while to see which one is the better option. Do you have any ideas and feedbacl on this?

I think we have the same experience level with black :) But after using it in your library and reading documentation, I think I will move from yapf to black, because it seems to be more strict and popular, and also I like its formatting now (however before I didn't like it :)). yapf is great but it has several bugs which will create inconsistent formatting sometimes (e.g., nested brackets formatting). Probably you mean the choice between pep8 linter (pep8speaks?) and flake8 linter? Because black is PEP 8 compliant auto-formatter. If it is, as I understand from blogs and guides of experienced Python developers, the most popular commonly used tools for CI and pre-commit hooks are black/isort/flake8+plugins (and mypy for typing annotations).

Oh I am sorry! My bad!

I think there are no reasons for apologies, just my thoughts and notes that probably can be useful (or not :)). I am learning a lot from this contribution, so thank you for this opportunity :)

Actually that's a good call. Right now, I have my VSCode setup to do that on "save" but I think that pre-commit hooks are probably a better option. I think that would be good to add to the contributor list if we keep black. Otherwise it can get annoying I think (you are the first tester of black here :P, sorry for the trouble)

I also have "format on save" in VS Code, but pre-commit can help because even after saving there can be some problems that can't be resolved with formatting (and we forget to resolve some warnings). And it is what pre-commit hook can catch (because it can format/lint by many tools). But of course it can be optional step for contributors who can save the time :) So I think that "format on save"/linting in IDE, pre-commit hooks and CI can complement each other.

I would say: no worries for now! I just read your excellent Medium guide, and next I still need to take a look at the code files in the next few days :)

Great news, thank you :) I think I will update the article after the feature will be merged, and I am going to consider to migrate the article to GitHub pages which I've never used but more and more blogs are based on it. I have idea to continue writing the articles there.

I forgot how I set it up, but yeah I have a Jupyter Lab extension now that formats notebooks as well, which is very nice! Just curious, it sounds like the mlxtend CI complained about notebook formatting before? Maybe I should add a note about that in the contributor guide then. Or did you mean that it generally skips notebooks for now?

Probably you mean 3rd party Jupyter Lab extension like nb_black, but in this case I used black feature to format notebooks: https://black.readthedocs.io/en/stable/change_log.html?#id31 I saw this hint in the logs of your GitHub actions workflow (example: https://github.com/rasbt/mlxtend/runs/6267367514?check_suite_focus=true#step:3:33), that's why I tried it. I will try to use this feature for a while to understand will it be useful in practice :)

I think this is unfortunately necessary. I think it would be nice to have a documentation-requirements.txt. But you don't have to build the make_api.py to be honest. It's really just for the last cell in the notebook, and I can take care of that. Once we have the notebook ready. No worries. Regarding dlib though, I think it was usually painful to install, but I think I installed that relatively painlessly with conda-forge recently, and I didn't need to download anything manually.

It was not painful to install, just the notes :) So maybe you will decide it to include these libraries for contributors.

Thank you.

labdmitriy commented 2 years ago

Run command not only for the new notebook but also for all notebooks, to have .md files for all notebooks (otherwise there will be errors while executing mkdocs serve as the next step): python ipynb2markdown.py -a . It will generate .md files and also _files directories for images, and I found that _files are in version control, whereas .md do not. As I understand _files directories do not make sense without *.md files, so I haven't added a folder yet.

In this statement I mean that to check documentation after adding new sections, new contributor should run this command to convert all notebooks to markdown files (which do not exist initially), not just convert new notebook. Otherwise it will not be rendered correctly. Just the note from new contributor :)

rasbt commented 2 years ago

Probably you mean the choice between pep8 linter (pep8speaks?)

Ah yes, that's what I meant :). I think pep8speaks can potentially be removed if black works well as a github workflow. Btw. I added the notes about the pre-commit hooks.

Probably you mean 3rd party Jupyter Lab extension like nb_black, but in this case I used black feature to format notebooks: https://black.readthedocs.io/en/stable/change_log.html?#id31

Oh yes, I need to look into this then. This sounds useful as an additional Gh workflow.

In this statement I mean that to check documentation after adding new sections, new contributor should run this command to convert all notebooks to markdown files (which do not exist initially), not just convert new notebook. Otherwise it will not be rendered correctly. Just the note from new contributor :)

I see what you mean. I think in most cases, just modifying the notebook is sufficient. But if someone wants to view it via mkdocs, that would require converting all files to .md. Another thing to make more clear in the contributor guide.

Actually, this whole discussion is super helpful. I just noticed how out of date the contributor guide is. I don't use it myself as I usually know how to make the changes here since I have done that many times. And I think other contributors never really bothered reading it, lol.

labdmitriy commented 2 years ago

Let me know if I forgot to address anything :)

Probably the questions in related issue which I initially had, but there are many questions :)

labdmitriy commented 2 years ago

Hi Sebastian,

Could you please tell if there are plans to work with this pull request in the next few weeks? Just to plan the time for the other tasks.

By the way, during our discussion I thought that it will be interesting to research full pipeline of creating Python package, from IDE configuration and project structure to testing and CI, and create several articles about that. I had a similar experience in setting up projects at my job before, but I got some ideas on how to improve it. I want to explore this pipeline from scratch to deep dive (as far as I can) in each step. Are you interested in any kind of collaboration about it? Few days ago I've configured blog based on GithHub Pages and wrote the first short post for specific task and I'm planning to use it for the next articles.

Thank you.

rasbt commented 2 years ago

Sorry, I really need & want to get back to this! Somehow I have too many projects going on 😅. Let me review the PR today and give you some feedback.

I want to explore this pipeline from scratch to deep dive (as far as I can) in each step. Are you interested in any kind of collaboration about it?

This sounds super cool to be honest. Haha, but given that I currently already have too many things on which I am far behind, I should probably say "no." It's not that I am not interested, but I really need to finish things before I start new things!

labdmitriy commented 2 years ago

Sorry, I really need & want to get back to this! Somehow I have too many projects going on 😅. Let me review the PR today and give you some feedback.

I want to explore this pipeline from scratch to deep dive (as far as I can) in each step. Are you interested in any kind of collaboration about it?

This sounds super cool to be honest. Haha, but given that I currently already have too many things on which I am far behind, I should probably say "no." It's not that I am not interested, but I really need to finish things before I start new things!

Ok, no problem.

What I had in mind was something like

Is it something different from the test_cross_val_score function which I wrote before (the last function in the test file)? Could you please clarify which changes I should do?

Also I remember that I also usually use known_first_party parameter for isort to split developed package from the 3rd party packages, to be PEP8-compliant: https://peps.python.org/pep-0008/#imports

I understand that you have too much projects so it is not a problem at all, if you want we can continue with this pull request later and have a break now with it or even stop it at all, the goal was to collaborate with you and gain an experience in contributing, and of course this is not the thing with the highest priority :) This question was only to clarify your plans about this PR, by no means forcing you :)

rasbt commented 2 years ago

1) Ohhh, I was maybe looking at the old file. You are totally right! The new test function you added with cross_val_score is totally sufficient!

To be honest, the code looks really good to me now. The next thing on my list is to check out the Jupyter Nb then.

2)

Yeah, I personally also always (often) separated standard lib imports and 3rd party imports (when I don't forget). So, I like the idea of adding the known_first_party parameter. I can add that.

3)

I understand that you have too much projects so it is not a problem at all, if you want we can continue with this pull request later and have a break now with it or even stop it at all, the goal was to collaborate with you and gain an experience in contributing, and of course this is not the thing with the highest priority :)

Oh, I definitely don't want to drop this. It's a really nice and useful PR. I also learned a lot regarding black & isort. Very useful!

I will try to review the Nb either later tonight or tomorrow early morning to give some more feedback.

labdmitriy commented 2 years ago

To be honest, the code looks really good to me now. The next thing on my list is to check out the Jupyter Nb then.

Great! As I mentioned before, Jupyter notebook is just a draft and your feedback will be very useful because I don't know what is expected finally. Thank you!

Yeah, I personally also always (often) separated standard lib imports and 3rd party imports (when I don't forget). So, I like the idea of adding the known_first_party parameter. I can add that.

I made the comment in the relevant pull request about updated configuration, probably mlxtend was implied not biopandas.

Oh, I definitely don't want to drop this. It's a really nice and useful PR. I also learned a lot regarding black & isort. Very useful!

I will try to review the Nb either later tonight or tomorrow early morning to give some more feedback.

Excellent!

This sounds super cool to be honest. Haha, but given that I currently already have too many things on which I am far behind, I should probably say "no." It's not that I am not interested, but I really need to finish things before I start new things!

I've been thinking about it and would like to suggest returning to this task when you have time, if there is still interest for you. Perhaps by that time I will already have several articles on this topic, and we could update the process in your library. I think it might be useful for both of us.

rasbt commented 2 years ago

The documentation is a great start. It looks very comprehensive, and I love the plots. What's nice about them is that they are automatically generated, so that this allows us and the users to create the plots for all kinds of scenarios.

However, regarding the documentation of use cases, I don't think it needs to be exhaustive and show all possible ways you can use it. This would be very overwhelming.

My suggestions are to focus on a few, but give the users the tools to explore the other ones if they are interested. (I.e., if we have a few well explained examples, users can copy and modify them).

So, concretely, here are a few suggestions:

[ ] Move the helper function into a utility function into the library so that it can be imported without taking too much attention/space in the documentation. This will make the documentation more readable. You can move it into mlxtend.evalaute.time_series. These should be private methods though that shouldn't come up in the API documentation. I think that happens automatically as long as you don't add them in __init__.py.
[ ] It would be nice if each headline could tell the reader what the demonstration or use case is that they are reading about. I think there only need to be a few sections in my opinion. For example,

Your first example could be:

A time series cross-validation iterator with multiple training groups

Then, the second one could be

Defining the gap size between training and test folds
Expanding the window size

and that's it.

[ ] for each example, I would have the visualization followed by the a "Usage in CV" example. This way, readers don't have to read the whole docs in order to figure that out. Each example will be self-contained in this way.
[ ] I would get rid of the failure cases in the docs. They are more like unit tests I would say? The errors are descriptive enough to figure these issues out if users encounter them.
[ ] Once the general structure is there, it would be nice to add some more descriptive text

labdmitriy commented 2 years ago

Hi Sebastian,

Thanks a lot for your feedback, I will prepare the notebook based on your requirements and will push all the changes.

labdmitriy commented 2 years ago

Hi Sebastian,

I've made changes in Jupyter notebook and the code based on your comments, and have a few questions/notes:

Notes

Based on the fact that there are 2 combinations of required parameters (test_size + train_size and test_size + n_splits), I decided to include 4 examples in the notebook.
I decided to include plot, split and cv usage information for each example
I added the last cell to the notebook which includes API description (similar to another notebooks).
I have one trouble with highlighting when I print the split/cv info, and the word with the colon at the end is highlighted with red color.
I found similar issue on GitHub, but don't understand how to solve it in this case, could you please help with it? You can see it after you render the documentation.

Questions

Should I still add *_files directory for GroupTimeSeriesSplit?
I noticed that there are some differencies in isort/black configuration for pre-commit hook and CI:

isort Pre-commit hook: args: ["--profile", "black"] CI: isort --check --diff --line-length 88 --multi-line 3 --py 39 --profile black mlxtend/*

black Pre-commit hook: language_version: python3.9 CI: black --check --diff mlxtend/*

I checked default isort configuration for black profile:
```
multi_line_output: 3
include_trailing_comma: True
force_grid_wrap: 0
use_parentheses: True
ensure_newline_before_comments: True
line_length: 88
```
Probably line length and multiline settings are not required to specify it manually.
But perhaps Python versions can be inconsistent because isort and black have the following default Python version parameters (considering that the Python version 3.8 is specified in CI):

isort
```
--py {all,2,27,3,310,35,36,37,38,39,auto}, --python-version {all,2,27,3,310,35,36,37,38,39,auto}
                      Tells isort to set the known standard library based on the specified Python version. Default is to assume any
                      Python 3 version could be the target, and use a union of all stdlib modules across versions. If auto is specified,
                      the version of the interpreter used to run isort (currently: 38) will be used.
```
black
```
-t, --target-version [py33|py34|py35|py36|py37|py38|py39|py310]
                                Python versions that should be supported by
                                Black's output. [default: per-file auto-
                                detection]
```
Also to start using isort/black/flake8 rules for development, contributor should specify all the configurations from hook or CI manually (and also should install these libraries), to format on save in IDE.

Maybe configuration of the files like .isort.cfg, .black will be the solution have consistent rules everywhere? There are more modern approaches in configuration like pyproject.toml, but unfortunately it is still not supported by flake8. Here is a good description about configuration using different files.

Also I saw that for flake8 you have configuration file .flake8 that can be used for IDE, default configuration is used for pre-commit hook (as I understand from here, we need to specify arguments for hooks explicitly), and for CI the are another custom settings.

And in precommit configuration file you have the order black-flake8-isort, is it correct order (linting will be executed before all formatting steps are completed)? Maybe flake8 should be the last step?

Again I am asking too many questions 😄, but I think (and hope 🤞) it can be useful.

Thank you.

labdmitriy commented 2 years ago

Hi @rasbt,

Could you please tell do I need to improve something for this PR?

Thank you.

rasbt commented 2 years ago

I really like this restructured version! A few points that I think can be improved

1

Screen Shot 2022-05-20 at 3 26 20 PM

It's nice to start off with a general problem introduction (just as you did). However, considering that there is https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html, people might also be curious about the relationship to TimeSeriesSplit and the features GroupTimeSeriesSplit adds.

2

A few words about the example data would be helpful. Screen Shot 2022-05-20 at 3 29 05 PM

First, I would put the features and targets first, and then start with something like "For the following examples, we are creating an example dataset consisting of 16 training data points ...". And then you can explain that we create 6 different groups so that the first training example belongs to group 0, the next 4 to group 1, and so forth.

Btw. side questions about the implementation: do the groups do have to be consecutive order? Or could it be

groups = np.array([0, 5, 5, 5, 5, 2, 2, 2, 3, 3, 4, 4, 1, 1, 1, 1])

3

For each example here, it would also be nice to start with a few words describing what we are looking at here: Screen Shot 2022-05-20 at 3 35 02 PM

Otherwise it is pretty good and much more accessible than before! Thanks for the update!

rasbt commented 2 years ago

I have one trouble with highlighting when I print the split/cv info, and the word with the colon at the end is highlighted with red color. I found similar https://github.com/mkdocs/mkdocs/issues/902#issuecomment-211520759 on GitHub, but don't understand how to solve it in this case, could you please help with it? You can see it after you render the documentation.

Sure, let's revisit this when we have the final version. I can test this in my local mkdocs version then.

Should I still add *_files directory for GroupTimeSeriesSplit?

That would not be necessary. I will also remove the other folders in the future to save space on GitHub

Good points regarding the CI/Workflow setups. Regarding line-length I just wanted to be explicit about that as a visual aide (so that it is clear that this is 88 without knowing the defaults. I think some people are still new to black and expect it to be 79 perhaps.

I think I had some issues with multiline which is why I added it but I don't remember to be honest. Do you know if you can run it --py 39 in older Python versions without issue btw?

The inconsistency you mentioned refers to the missing --py 39 in .pre-commit-config.yaml?

labdmitriy commented 2 years ago

Hi @rasbt,

It's nice to start off with a general problem introduction (just as you did). However, considering that there is https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html, people might also be curious about the relationship to TimeSeriesSplit and the features GroupTimeSeriesSplit adds.

I described the advantages of this implementation over scikit-learn's TimeSeriesSplit

First, I would put the features and targets first, and then start with something like "For the following examples, we are creating an example dataset consisting of 16 training data points ...". And then you can explain that we create 6 different groups so that the first training example belongs to group 0, the next 4 to group 1, and so forth.

I create dataset with features and specify months as index to have more clear description for usage examples, therefore I decided to define groups and months before features and target. Now I reordered these sections due to your recommendations, and just set the index at the end of this code block.

Btw. side questions about the implementation: do the groups do have to be consecutive order?

They could be as in your example and I have corresponding tests and description in the first section of the notebook, and I supposed that this order (not in ascending order but with continuous group values) is "consecutive". But probably this description was not clear enough so I added additional correct/incorrect examples of the groups order. How do you think maybe there is another word better than 'consecutive' to describe it?

For each example here, it would also be nice to start with a few words describing what we are looking at here:

Done

Sure, let's revisit this when we have the final version. I can test this in my local mkdocs version then.

Thank you!

That would not be necessary. I will also remove the other folders in the future to save space on GitHub

Great!

I think I had some issues with multiline which is why I added it but I don't remember to be honest. Do you know if you can run it --py 39 in older Python versions without issue btw?

Unfortunately I don't have such experience but probably when I will progress in articles about development, I can tell more exact.

The inconsistency you mentioned refers to the missing --py 39 in .pre-commit-config.yaml?

Here I just tried to share my thoughts that CI and pre-commit hooks has probably little different configurations for isort and black, and the decision what to add or delete depends on the desired target configuration.

To be more succinct, my another notes were about the following:

The specified order in pre-commit hook is black-flake8-isort, but probably black-isort-flake8 will be more correct?
flake8 has 3 somewhat different configurations (.flake8, pre-commit hook and CI).

Thank you.

labdmitriy commented 2 years ago

@rasbt I also fixed list rendering for Overview section. Now it seems that I've made all the changes you mentioned.

labdmitriy commented 2 years ago

Hi @rasbt,

Could you please tell if there are no plans to finish this pull request over the next few weeks? If so then I will switch to another tasks and will come back to this PR in summer.

Thank you.

rasbt commented 2 years ago

Thanks for updating the docs, and thanks for your patience. I currently have too many ongoing projects, so I can't check in every day. So, if you want to revisit this in summer, I can totally understand. On the other hand, I think this PR is very close. It's just a bit of polishing the docs. I took a crack at it and did some minor rewording and adding a few sentences here and there. Maybe have a look if that looks good to you. And if you don't have any further feedback, I'd say it's good to merge after adding the Changelog entry.

Btw when going over the docs, there was mainly only one thing that I found a bit confusing. I.e., in example 4, I wasn't sure what exactly the expanding window size was?

E.g., here an image from Example 3:

Screen Shot 2022-05-24 at 9 41 47 AM

And an image from Example 4: Screen Shot 2022-05-24 at 9 42 07 AM

What's exactly expanded here? Do you mean that there are now 4 instead of 3 training groups (but this is specified) or is there something else I am missing?

My best guess was that you mean that the splits depend on the training and test sizes, so I moved this illustration up to example 1 where we talk about the train and test group sizes. I think this way it is a more natural reading order for the user. What do you think?

labdmitriy commented 2 years ago

Hi @rasbt,

Sorry if you feel that I am forcing you, it was not my intention. I thought that several last times I disturbed you, and decided to switch to another tasks until you have more free time. I will definitely answer to all the questions tomorrow and will stop with my endless questions. Sorry again.

Thank you.

rasbt commented 2 years ago

No worries! It's all good :). I am still very excited about this PR and sometimes just wish the day has more hours 😅

labdmitriy commented 2 years ago

Hi @rasbt,

Thank you for your update, your description is much clearer and cleaner than mine, of course it is good for me. I just fixed one my typo.

Btw when going over the docs, there was mainly only one thing that I found a bit confusing. I.e., in example 4, I wasn't sure what exactly the expanding window size was? What's exactly expanded here? Do you mean that there are now 4 instead of 3 training groups (but this is specified) or is there something else I am missing? My best guess was that you mean that the splits depend on the training and test sizes, so I moved this illustration up to example 1 where we talk about the train and test group sizes. I think this way it is a more natural reading order for the user. What do you think?

Probably I didn't understand your recommendation from here correctly:

Expanding the window size

I thought that it is about expanding the size of training or test dataset, but maybe you mean another type of window (expanding). Anyway Example 1 seems to be useful especially after you regroup the order. Now it is even better than I expected.

And if you don't have any further feedback, I'd say it's good to merge after adding the Changelog entry.

I added new entry to Changelog for GroupTimeSeriesSplit.
Just noticed that the section New Features and Enhancements is appears twice for 0.20.0 version so I added the description to the end of the list.

Thank you.

rasbt commented 2 years ago

Thanks, I think this PR is ready to merge now. Thanks so much for the hard and dedicated work on this! And sorry for me not being as responsive, it's a really busy time for me right now. However, I am super glad about this PR. I will maybe try to make even a new version release tonight so that it can be used.

Regarding the

Expanding the window size

comment. We can always fine-tune the documentation later. It's not coupled to the main code. But what I meant was that I found that example 4 was basically a natural extension of example 1, so I merged them. From a reader's perspective, I was thinking that this is more intuitive this way.

labdmitriy commented 2 years ago

Thanks, I think this PR is ready to merge now. Thanks so much for the hard and dedicated work on this! And sorry for me not being as responsive, it's a really busy time for me right now. However, I am super glad about this PR. I will maybe try to make even a new version release tonight so that it can be used.

Thank you for your patience while answering to all my questions, it was very useful and interesting experience!

But what I meant was that I found that example 4 was basically a natural extension of example 1, so I merged them. From a reader's perspective, I was thinking that this is more intuitive this way.

You are totally correct, it is much better now.

I you don’t mind, I will write the article about this experience, because I know a lot of people (like me until recently) thinking that contributing is too difficult to even try it. I will always be happy to discuss with you any questions and collaboration if you ever have free time and interest.

rasbt commented 2 years ago

Thank you for your patience while answering to all my questions, it was very useful and interesting experience!

Actually, this was really great. During this process, we added lots of useful things like pre-commit hooks, black, isort, etc. :)!

I you don’t mind, I will write the article about this experience, because I know a lot of people (like me until recently) thinking that contributing is too difficult to even try it.

Sure, I think this is worthwhile and will be interesting for many people!

I will always be happy to discuss with you any questions and collaboration if you ever have free time and interest.

Cool! I will keep that in mind!

rasbt / mlxtend

Group time series split #915

Code of Conduct

Description

Related issues or pull requests

Pull Request Checklist

Comment last updated at 2022-05-25 06:50:32 UTC

1

2

3