Add group time series validation

labdmitriy commented 2 years ago

Describe the workflow you want to enable

Add support for cross-validation for sequences with continuous groups, especially for time series data.

Describe your proposed solution

Initially I wrote it for my personal tasks, but after that I decided to publish it on GitHub (https://github.com/labdmitriy/ml-lab), wrote an article about that (https://medium.com/@labdmitriy/advanced-group-time-series-validation-bb00d4a74bcc), and provided Jupyter Notebook with examples (https://github.com/labdmitriy/ml-lab/blob/master/notebooks/GroupTimeSeriesSplit.ipynb). Also @alexandrnikitin provided additional feature to support group names in addition to group numbers.

Describe alternatives you've considered, if relevant

Additional context

This is the feature which we discussed here

labdmitriy commented 2 years ago

Hi Sebastian,

I'm exploring contributing process based on your guide and trying to prepare environment for feature development. I found that you have requirements.txt for pip and environment.yml for conda with different list and versions of the packages (for example, pandas>=1.3.4 in environment.yml and pandas>=0.24.2 in requirements.txt). Also I check GitHub Actions, and check that you use python version 3.8 and conda environment based on "base" environment, environment.yml file and following conda/pip installations during the test (and some packages are the same as in environment.yml but with different version). Could you please provide additional information about the following:

Can I use any minor version of python 3.8 (e.g., 3.8.5) for feature development?
How can I reproduce development environment to be compatible with required dependencies? I suppose that requirements.txt is out-of-date, I have experience with conda/pip/poetry, but while working on Linux I consider using pip or poetry for development.
Do I need to configure my autoformatters to flake8 with arguments which you provide in GitHub Actions, or probably I can get already configured file for autoformatter somewhere?

Thank you.

rasbt commented 2 years ago

Good call, thanks for the note. I think this discrepancy is from the fact that in the original CI setup, I was testing the minimal versions and the latest versions. However, when Travis-CI stopped being free for open source, I switched to Appveyor and then to Gh Workflows, and this is I think where the discrepancy between the requirements.txt and environment.yml come from.

This needs to be updated some time #912. I think one difficulty is that the unit tests depend on newer sklearn versions due to some numerical rounding differences, but someone could also totally use an older sklearn version with mlxtend, that's not a problem at all, so I don't want to make the newest sklearn version a hard requirement.

For now, I would recommend using the versions in environment.yml (which you could install via pip). I will make a note to look into somehow restructure this (via #912).

Can I use any minor version of python 3.8 (e.g., 3.8.5) for feature development?

The Python 3.8 came from the fact that I remember having issues downloading either the miniforge installer or some packages for newer versions. I think using any minor version should be fine. In fact, I believe any Python version down to 3.6 should probably work.

Do I need to configure my autoformatters to flake8 with arguments which you provide in GitHub Actions, or probably I can get already configured file for autoformatter somewhere?

I'd say that

pip install flake8 .flake8 yournewcode

should be enough. There are a few extra arguments in the workflow but they are there for different legacy reasons. I actually want to set up black some time to take care of the formatting #911

Those are all really good points! I wish days have more than 24h. So many little things that need improvement :)

labdmitriy commented 2 years ago

Thank you for the answer! Probably 48 hours per day will be better :) Then I will use environment.yml with pip. I can try black for this feature development, if you want to check it as an example. I saw you use single and double quotes in code, so what me stops to use black usually is strict rules like double quotes, trailing commas etc. :)

rasbt commented 2 years ago

I'd say don't worry too much about the style. Basic PEP8 compliance is ok, and I am happy to help with these on the PR. I have a PEP8 plugin on GitHub that would also highlight issues if you submit a PR.

Regarding black, I want to set it up globally so that it takes care of the style in each PR. I would then also apply it retrospectively so that it will then make the single and double-quote thing more consistent. For now, I wouldn't worry about it too much. I think that unit tests and coming up with good examples are the main points.

I am also happy to help with the doc building since I am super familiar with this specific process/workflow. If you set up a basic PR, I can add to it by adding doc templates and so forth.

rasbt commented 2 years ago

Oh on a side-note, I just remember why I am installing the packages into the conda base env via workflows. Was trying

     conda config --add channels conda-forge
     conda create -n testing python=3.8
     conda activate testing
     conda env update --file environment.yml

but then it complaints about the missing conda init. Adding

conda init zsh

after creating the env still results in an error saying that the shell needs to be restarted, which is not really feasible with Gh worflows.

labdmitriy commented 2 years ago

I think that if you want to use conda for GitHub Actions, you can use solution which is recommended by GitLab, and also there is nice explanation about that. Did you try poetry? I switched to it last year and now use it in all projects except for the cases where there are some restrictions (then I use pip or conda as a last resort).

rasbt commented 2 years ago

Ah thanks, sourcing the shell works! Regarding poetry, I tried it once and it's nice. However, there is nothing wrong with conda, so I am hesitant about switching it in the CI. I am also more familiar with it as I use it on my Mac. Can't use poetry there because I have a M1-based Mac and PyPI libraries are a bit unreliable for that.

labdmitriy commented 2 years ago

Hi Sebastian,

It took me a while today to figure out how to change my current implementation to follow your style, and write tests (because I haven't used test yet in real projects, only during education), so I implemented the following steps (comparing to original code):

Install mlxtend in editable mode (pip install -e .)
Change docstring format except for private method (_calculate_split_params())
Exclude internal dependency from sklearn (_num_samples() method)
Remove verbose comments
Exclude typing annotations
Include section Yields for split method instead of section Returns in class docstring (like in sklearn TimeSeriesSplit - https://github.com/scikit-learn/scikit-learn/blob/baf828ca1/sklearn/model_selection/_split.py#L944), but if we need Returns section for documentation I will change it.
Yapf with default parameters is used for autoformatting.
Add new module to mlxtend/evaluate/__init__.py, but I have a warning with flake8 (described below in the Notes section).
Add tests with 100% coverage and fixtures to reuse common datasets

Notes

Flake8 doesn't raise any warnings for the final version of the cross-validator. I ran flake8 without arguments. However if I try to run it without arguments for ./mlxtend, I have multiple warning in multiple modules, including mlxtend/evaluate/init.py which I also changed. I decided to try arguments from your GitHub Actions configuration:

flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

The first line does not raise any warning for ./mlxtend, and the second line is forced to finish successfully because of --exit-zero argument. So it seems that contribution guide do not mention these details and these lines are more helpful in code checking. One of the examples is max length line, which probably is not checked strictly here (?), so I've used max line length as 79 and therefore maybe some formatting can seem a little weird. Update: I saw you recently started to prepare the code to the black formatter, probably then I should use it?

Also there is not import sorting and dividing to different types of imports in mlxtend/evaluate/init.py, so I've just added required lines to the end.
Because test_size argument is required, test functions naming do not include this parameter.
Tests in sklearn use split(X, y, groups) in either case, I decided to test the same approach.
I found that tensorflow is not mentioned in environment.yml, but it is required in several modules including tests (which failed without this package installed).
I still have one dependency from sklearn (indexable() method) which is frequently used in estimators, do I need to remove it?
I saw that you BSD 3 clause license - will I have any restrictions in using this modified code in my own original library (ml-lab)?
I didn’t specify the first line to the modules description - with your name and period range - what should I specify there?
Could you please provide information about further steps for documentation/examples/changelog etc.?
Where will be better to discuss the following steps - in the issue or in pull request?

Thank you.

rasbt / mlxtend