gotec / git2net

An Open Source Python package for the extraction of fine-grained and time-stamped co-editing networks from git repositories.
https://git2net.readthedocs.io
GNU Affero General Public License v3.0
53 stars 16 forks source link

Pandas (soon to be) deprecated methods, errors when resuming repo #41

Closed wschuell closed 2 months ago

wschuell commented 4 months ago

Hi @gotec, Finally using git2net again!

Describe the bug Pandas is planning to deprecate many functions/to have more discipline (supposedly preparing for pandas 3). Some trigger warnings, some already trigger errors. It is mainly about types having to be set and not inferred.

To Reproduce The tests with pytest already trigger some:

 gambit/algorithms/gambit.py:262: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
  You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
  A typical example is when you are setting values in a column of a DataFrame, like:

  df["col"][row_indexer] = value

  Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

    authors['author_id'][idx2] = authors.loc[idx1, 'author_id']

 git2net/git2net/visualisation.py:378: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
    zip(pd.to_datetime(data.time, format='%Y-%m-%d %H:%M:%S').view('int64'),

tests/test_functions.py::test_process_commit_merge
tests/test_functions.py::test_process_commit_merge
tests/test_functions.py::test_process_commit_merge2
tests/test_functions.py::test_process_commit_merge2
  git2net/git2net/extraction.py:917: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'accepted' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
    comp.loc[comp['_merge'] == 'both', '_action'] = 'accepted'

And an error that just appears as warning in the tests but can happen manually, on mining the git2net repo:

[2024-06-06 13:40:03]  git2net:INFO       Provided folder is not empty.
[2024-06-06 13:40:03]  git2net:INFO          Skipping the cloning and trying to resume.
[2024-06-06 13:40:03]  git2net:INFO       Found no database on provided path. Starting from scratch.
[2024-06-06 13:40:03]  git2net:ERROR      processing error: 40cc53f783aeb835fbec20f4d5e165af4e24fd32                                    
Serial:   0%|                                                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "git2net/blah/blah.py", line 8, in <module>
    git2net.mine_github(github_url, f'cloned_repos/{git_repo_dir}', sqlite_db_file,
  File "git2net/git2net/extraction.py", line 1916, in mine_github
    mine_git_repo(git_repo_dir, sqlite_db_file, **kwargs)
  File "git2net/git2net/extraction.py", line 1856, in mine_git_repo
    _process_repo_serial(git_repo_dir, sqlite_db_file, u_commits,
  File "git2net/git2net/extraction.py", line 1340, in _process_repo_serial
    _log_commit_results(log, exception)
  File "git2net/git2net/extraction.py", line 1308, in _log_commit_results
    raise Exception(exception)
Exception: git2net/git2net/extraction.py:1257: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_edits = pd.concat(

Two solutions here: init the empty df as a typed df; or testing for emptiness (df.empty) to know when merge is useless.

I could not replicate that in the tests, but i could fix it with explicitly ignoring the warnings (relevant lines at the beggining of the script). Even if they are just warnings at the moment, it probably could be a good idea to solve them before pandas 3.

import warnings

warnings.resetwarnings()
warnings.simplefilter(action='ignore', category=FutureWarning)

import os
import git2net

git_repo_dir = 'git2net'
github_url = f'gotec/{git_repo_dir}'
sqlite_db_file = f'{git_repo_dir}_git2net.db'

if not os.path.exists('cloned_repos'):
    os.makedirs('cloned_repos')

git2net.mine_github(github_url, f'cloned_repos/{git_repo_dir}', sqlite_db_file,
no_of_processes=1,
commits=[
'40cc53f783aeb835fbec20f4d5e165af4e24fd32',
]
)

the corresponding test entry: (i.e. never failing)

def test_mine_github_prob(github_url_short, github_repo_dir, sqlite_db_file):
    git2net.mine_github(github_url_short, github_repo_dir, sqlite_db_file,
no_of_processes=1,
commits=[
'40cc53f783aeb835fbec20f4d5e165af4e24fd32',
]
)

Desktop (please complete the following information):