rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.83k stars 855 forks source link

Add group time series validation #910

Closed labdmitriy closed 2 years ago

labdmitriy commented 2 years ago

Describe the workflow you want to enable

Add support for cross-validation for sequences with continuous groups, especially for time series data.

Describe your proposed solution

Initially I wrote it for my personal tasks, but after that I decided to publish it on GitHub (https://github.com/labdmitriy/ml-lab), wrote an article about that (https://medium.com/@labdmitriy/advanced-group-time-series-validation-bb00d4a74bcc), and provided Jupyter Notebook with examples (https://github.com/labdmitriy/ml-lab/blob/master/notebooks/GroupTimeSeriesSplit.ipynb). Also @alexandrnikitin provided additional feature to support group names in addition to group numbers.

Describe alternatives you've considered, if relevant

Additional context

This is the feature which we discussed here

labdmitriy commented 2 years ago

Hi Sebastian,

I'm exploring contributing process based on your guide and trying to prepare environment for feature development. I found that you have requirements.txt for pip and environment.yml for conda with different list and versions of the packages (for example, pandas>=1.3.4 in environment.yml and pandas>=0.24.2 in requirements.txt). Also I check GitHub Actions, and check that you use python version 3.8 and conda environment based on "base" environment, environment.yml file and following conda/pip installations during the test (and some packages are the same as in environment.yml but with different version). Could you please provide additional information about the following:

Thank you.

rasbt commented 2 years ago

Good call, thanks for the note. I think this discrepancy is from the fact that in the original CI setup, I was testing the minimal versions and the latest versions. However, when Travis-CI stopped being free for open source, I switched to Appveyor and then to Gh Workflows, and this is I think where the discrepancy between the requirements.txt and environment.yml come from.

This needs to be updated some time #912. I think one difficulty is that the unit tests depend on newer sklearn versions due to some numerical rounding differences, but someone could also totally use an older sklearn version with mlxtend, that's not a problem at all, so I don't want to make the newest sklearn version a hard requirement.

For now, I would recommend using the versions in environment.yml (which you could install via pip). I will make a note to look into somehow restructure this (via #912).

Can I use any minor version of python 3.8 (e.g., 3.8.5) for feature development?

The Python 3.8 came from the fact that I remember having issues downloading either the miniforge installer or some packages for newer versions. I think using any minor version should be fine. In fact, I believe any Python version down to 3.6 should probably work.

Do I need to configure my autoformatters to flake8 with arguments which you provide in GitHub Actions, or probably I can get already configured file for autoformatter somewhere?

I'd say that

pip install flake8 .flake8 yournewcode

should be enough. There are a few extra arguments in the workflow but they are there for different legacy reasons. I actually want to set up black some time to take care of the formatting #911

Those are all really good points! I wish days have more than 24h. So many little things that need improvement :)

labdmitriy commented 2 years ago

Thank you for the answer! Probably 48 hours per day will be better :) Then I will use environment.yml with pip. I can try black for this feature development, if you want to check it as an example. I saw you use single and double quotes in code, so what me stops to use black usually is strict rules like double quotes, trailing commas etc. :)

rasbt commented 2 years ago

I'd say don't worry too much about the style. Basic PEP8 compliance is ok, and I am happy to help with these on the PR. I have a PEP8 plugin on GitHub that would also highlight issues if you submit a PR.

Regarding black, I want to set it up globally so that it takes care of the style in each PR. I would then also apply it retrospectively so that it will then make the single and double-quote thing more consistent. For now, I wouldn't worry about it too much. I think that unit tests and coming up with good examples are the main points.

I am also happy to help with the doc building since I am super familiar with this specific process/workflow. If you set up a basic PR, I can add to it by adding doc templates and so forth.

rasbt commented 2 years ago

Oh on a side-note, I just remember why I am installing the packages into the conda base env via workflows. Was trying

     conda config --add channels conda-forge
     conda create -n testing python=3.8
     conda activate testing
     conda env update --file environment.yml

but then it complaints about the missing conda init. Adding

conda init zsh

after creating the env still results in an error saying that the shell needs to be restarted, which is not really feasible with Gh worflows.

labdmitriy commented 2 years ago

I think that if you want to use conda for GitHub Actions, you can use solution which is recommended by GitLab, and also there is nice explanation about that. Did you try poetry? I switched to it last year and now use it in all projects except for the cases where there are some restrictions (then I use pip or conda as a last resort).

rasbt commented 2 years ago

Ah thanks, sourcing the shell works! Regarding poetry, I tried it once and it's nice. However, there is nothing wrong with conda, so I am hesitant about switching it in the CI. I am also more familiar with it as I use it on my Mac. Can't use poetry there because I have a M1-based Mac and PyPI libraries are a bit unreliable for that.

labdmitriy commented 2 years ago

Hi Sebastian,

It took me a while today to figure out how to change my current implementation to follow your style, and write tests (because I haven't used test yet in real projects, only during education), so I implemented the following steps (comparing to original code):

Notes

flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

The first line does not raise any warning for ./mlxtend, and the second line is forced to finish successfully because of --exit-zero argument. So it seems that contribution guide do not mention these details and these lines are more helpful in code checking. One of the examples is max length line, which probably is not checked strictly here (?), so I've used max line length as 79 and therefore maybe some formatting can seem a little weird. Update: I saw you recently started to prepare the code to the black formatter, probably then I should use it?

Thank you.