gotec / git2net

An Open Source Python package for the extraction of fine-grained and time-stamped co-editing networks from git repositories.
https://git2net.readthedocs.io
GNU Affero General Public License v3.0
54 stars 16 forks source link

Lack of sanity checks for commit.author.name crashes application #34

Closed cmtg closed 1 year ago

cmtg commented 1 year ago

PROBLEM:

While running the extraction on https://github.com/jekyll/jekyll.git, I got the following error message:

-----------------------8< snip------------------------------------


GitCommandError Traceback (most recent call last) in ----> 1 git.Git('/tmp/git2net/jekyll').check_mailmap('--global fen@nice.lgbt')

2 frames /usr/local/lib/python3.8/dist-packages/git/cmd.py in (*args, kwargs) 637 if name[0] == '_': 638 return LazyMixin.getattr(self, name) --> 639 return lambda *args, *kwargs: self._call_process(name, args, kwargs) 640 641 def set_persistent_git_options(self, **kwargs: Any) -> None:

/usr/local/lib/python3.8/dist-packages/git/cmd.py in _call_process(self, method, *args, kwargs) 1182 call.extend(args_list) 1183 -> 1184 return self.execute(call, exec_kwargs) 1185 1186 def _parse_object_header(self, header_line: str) -> Tuple[str, str, int]:

/usr/local/lib/python3.8/dist-packages/git/cmd.py in execute(self, command, istream, with_extended_output, with_exceptions, as_process, output_stream, stdout_as_string, kill_after_timeout, with_stdout, universal_newlines, shell, env, max_chunk_size, **subprocess_kwargs) 982 983 if with_exceptions and status != 0: --> 984 raise GitCommandError(redacted_command, status, stderr_value, stdout_value) 985 986 if isinstance(stdout_value, bytes) and stdout_as_string: # could also be output_stream

GitCommandError: Cmd('git') failed due to: exit code(129) cmdline: git check-mailmap --global fen@nice.lgbt stderr: 'error: unknown option `global fen@nice.lgbt' usage: git check-mailmap [] ...

--stdin               also read contacts from stdin

-----------------------8< snap------------------------------------

POSSIBLE ROOT CAUSE:

My understanding of the problem is a bogus author name in the repository: "fen@nice.lgbt" used "--global" as a name.

A quick check with pydriller confirmed that suspicion:

from pydriller import Repository repo = Repository("https://github.com/jekyll/jekyll.git") set([c.author.name+' # '+c.author.email for c in repo.traverse_commits() if c.author.email.endswith('lgbt')])

Output: {'--global # fen@nice.lgbt', 'fen # fen@nice.lgbt', 'jona # fen@nice.lgbt', 'penny # penny@penny.lgbt'}

POSSIBLE SOLUTION:

A quick&dirty fix might be to check for leading dashes in the name in order to escape the dashes, to remove dashes or to ignore the entire name.

gotec commented 1 year ago

Hi Christian

Thanks for your message. I was able to replicate the bug based on your example and have developed a fix.

Essentially, there is no way to check names that start with '--' using the git mailmap command, as it will always interpret the name as an option. Therefore, for these rare cases, the next best option is to only check the mailman based on the email, which should give the same result.

My test of this solution, mining the entire Jekyll repo, is still running. I will release a new version of git2net with the fix as soon as this is completed.

Cheers, Christoph

cmtg commented 1 year ago

Hello Christoph,

Thanks for the quick answer!

I did found a similar problem for within the Litecoin repository (https://github.com/litecoin-project/litecoin.git). In that case the bogus author name is "--author=Satoshi Nakamoto".

Just in case you want to test the new code with more than one repository.

Thanks again

Christian

gotec commented 1 year ago

Hi Christian

I’ll test it with litecoin too.

Out of curiosity: At this point I have mined over 20,000 repositories with git2net but never encountered this case. Are you using any specific options during mining? Which version of git and which OS are you using?

Im asking this as, like I said, my test in real data is still running so I’m curious if I will actually be able to replicate the error there too.

Cheers, Christoph

cmtg commented 1 year ago

Hello Christoph,

this is (essentially) the code I am running:

os.system("cd /tmp; git clone https://github.com/litecoin-project/litecoin.git") git2net.mine_git_repo("/tmp/litecoin", "/tmp/litecoin-project__litecoin.db")

I am using Google Colab Notebooks, so maybe versioning is part of the root cause.

Please find below the output of pip3 freeze:

absl-py==1.3.0 aeppl==0.0.33 aesara==2.7.9 aiohttp==3.8.3 aiosignal==1.3.1 alabaster==0.7.12 albumentations==1.2.1 altair==4.2.0 appdirs==1.4.4 arviz==0.12.1 astor==0.8.1 astropy==4.3.1 astunparse==1.6.3 async-timeout==4.0.2 atari-py==0.2.9 atomicwrites==1.4.1 attrs==22.2.0 audioread==3.0.0 autograd==1.5 Babel==2.11.0 backcall==0.2.0 beautifulsoup4==4.6.3 bleach==5.0.1 blis==0.7.9 bokeh==2.3.3 branca==0.6.0 bs4==0.0.1 CacheControl==0.12.11 cachetools==5.2.1 catalogue==2.0.8 certifi==2022.12.7 cffi==1.15.1 cftime==1.6.2 chardet==4.0.0 charset-normalizer==2.1.1 click==7.1.2 clikit==0.6.2 cloudpickle==2.2.0 cmake==3.22.6 cmdstanpy==1.0.8 colorcet==3.0.1 colorlover==0.3.0 community==1.0.0b1 confection==0.0.3 cons==0.4.5 contextlib2==0.5.5 convertdate==2.4.0 crashtest==0.3.1 crcmod==1.7 cufflinks==0.17.3 cvxopt==1.3.0 cvxpy==1.2.3 cycler==0.11.0 cymem==2.0.7 Cython==0.29.33 daft==0.0.4 dask==2022.2.1 datascience==0.17.5 db-dtypes==1.0.5 dbus-python==1.2.16 debugpy==1.0.0 decorator==4.4.2 defusedxml==0.7.1 descartes==1.1.0 dill==0.3.6 distributed==2022.2.1 dlib==19.24.0 dm-tree==0.1.8 dnspython==2.2.1 docutils==0.16 dopamine-rl==1.0.5 earthengine-api==0.1.335 easydict==1.10 ecos==2.0.12 editdistance==0.5.3 en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl entrypoints==0.4 ephem==4.1.4 et-xmlfile==1.1.0 etils==1.0.0 etuples==0.3.8 fa2==0.3.5 fastai==2.7.10 fastcore==1.5.27 fastdownload==0.0.7 fastdtw==0.3.4 fastjsonschema==2.16.2 fastprogress==1.0.3 fastrlock==0.8.1 feather-format==0.4.1 filelock==3.9.0 firebase-admin==5.3.0 fix-yahoo-finance==0.0.22 Flask==1.1.4 flatbuffers==1.12 folium==0.12.1.post1 frozenlist==1.3.3 fsspec==2022.11.0 future==0.16.0 gambit-disambig==1.0.3 gast==0.4.0 GDAL==3.0.4 gdown==4.4.0 gensim==3.6.0 geographiclib==1.52 geopy==1.17.0 gin-config==0.5.0 git2net==1.6.1 gitdb==4.0.10 GitPython==3.1.27 glob2==0.7 google==2.0.3 google-api-core==2.11.0 google-api-python-client==2.70.0 google-auth==2.16.0 google-auth-httplib2==0.1.0 google-auth-oauthlib==0.4.6 google-cloud-bigquery==3.4.1 google-cloud-bigquery-storage==2.17.0 google-cloud-core==2.3.2 google-cloud-datastore==2.11.1 google-cloud-firestore==2.7.3 google-cloud-language==2.6.1 google-cloud-storage==2.7.0 google-cloud-translate==3.8.4 google-colab @ file:///colabtools/dist/google-colab-1.0.0.tar.gz google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.4.0 googleapis-common-protos==1.58.0 googledrivedownloader==0.4 graphviz==0.10.1 greenlet==2.0.1 grpcio==1.51.1 grpcio-status==1.48.2 gspread==3.4.2 gspread-dataframe==3.0.8 gym==0.25.2 gym-notices==0.0.8 h5py==3.1.0 HeapDict==1.0.1 hijri-converter==2.2.4 holidays==0.18 holoviews==1.14.9 html5lib==1.0.1 httpimport==0.5.18 httplib2==0.17.4 httpstan==4.6.1 humanize==0.5.1 hyperopt==0.1.2 idna==2.10 imageio==2.9.0 imagesize==1.4.1 imbalanced-learn==0.8.1 imblearn==0.0 imgaug==0.4.0 importlib-metadata==6.0.0 importlib-resources==5.10.2 imutils==0.5.4 inflect==2.1.0 intel-openmp==2023.0.0 intervaltree==2.1.0 ipykernel==5.3.4 ipython==7.9.0 ipython-genutils==0.2.0 ipython-sql==0.3.9 ipywidgets==7.7.1 itsdangerous==1.1.0 jax==0.3.25 jaxlib @ https://storage.googleapis.com/jax-releases/cuda11/jaxlib-0.3.25+cuda11.cudnn805-cp38-cp38-manylinux2014_x86_64.whl jieba==0.42.1 Jinja2==2.11.3 joblib==1.2.0 jpeg4py==0.1.4 jsonschema==4.3.3 jupyter-client==6.1.12 jupyter-console==6.1.0 jupyter_core==5.1.3 jupyterlab-widgets==3.0.5 kaggle==1.5.12 kapre==0.3.7 keras==2.9.0 Keras-Preprocessing==1.1.2 keras-vis==0.4.1 kiwisolver==1.4.4 korean-lunar-calendar==0.3.1 langcodes==3.3.0 Levenshtein==0.20.9 libclang==15.0.6.1 librosa==0.8.1 lightgbm==2.2.3 lizard==1.17.10 llvmlite==0.39.1 lmdb==0.99 locket==1.0.0 logical-unification==0.4.5 LunarCalendar==0.0.9 lxml==4.9.2 Markdown==3.4.1 MarkupSafe==2.0.1 marshmallow==3.19.0 matplotlib==3.2.2 matplotlib-venn==0.11.7 miniKanren==1.0.3 missingno==0.5.1 mistune==0.8.4 mizani==0.7.3 mkl==2019.0 mlxtend==0.14.0 more-itertools==9.0.0 moviepy==0.2.3.5 mpmath==1.2.1 msgpack==1.0.4 multidict==6.0.4 multipledispatch==0.6.0 multitasking==0.0.11 murmurhash==1.0.9 music21==5.5.0 natsort==5.5.0 nbconvert==5.6.1 nbformat==5.7.1 netCDF4==1.6.2 networkx==3.0 nibabel==3.0.2 nltk==3.7 notebook==5.7.16 numba==0.56.4 numexpr==2.8.4 numpy==1.21.6 oauth2client==4.1.3 oauthlib==3.2.2 okgrade==0.4.3 opencv-contrib-python==4.6.0.66 opencv-python==4.6.0.66 opencv-python-headless==4.7.0.68 openpyxl==3.0.10 opt-einsum==3.3.0 osqp==0.6.2.post0 packaging==21.3 palettable==3.3.0 pandas==1.3.5 pandas-datareader==0.9.0 pandas-gbq==0.17.9 pandas-profiling==1.4.1 pandocfilters==1.5.0 panel==0.12.1 param==1.12.3 parso==0.8.3 partd==1.3.0 pastel==0.2.1 pathlib==1.0.1 pathpy2==2.2.0 pathy==0.10.1 patsy==0.5.3 pep517==0.13.0 pexpect==4.8.0 pickleshare==0.7.5 Pillow==7.1.2 pip-tools==6.6.2 platformdirs==2.6.2 plotly==5.5.0 plotnine==0.8.0 pluggy==0.7.1 pooch==1.6.0 portpicker==1.3.9 prefetch-generator==1.0.3 preshed==3.0.8 prettytable==3.6.0 progressbar2==3.38.0 prometheus-client==0.15.0 promise==2.3 prompt-toolkit==2.0.10 prophet==1.1.1 proto-plus==1.22.2 protobuf==3.19.6 psutil==5.4.8 psycopg2==2.9.5 ptyprocess==0.7.0 py==1.11.0 pyarrow==9.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycocotools==2.0.6 pycparser==2.21 pyct==0.4.8 pydantic==1.10.4 pydata-google-auth==1.5.0 pydot==1.3.0 pydot-ng==2.0.0 pydotplus==2.0.2 PyDriller==2.1 PyDrive==1.3.1 pyemd==0.5.1 pyerfa==2.0.0.1 Pygments==2.6.1 PyGObject==3.36.0 pyjarowinkler==1.8 pylev==1.4.0 pymc==4.1.4 PyMeeus==0.5.12 pymongo==4.3.3 pymystem3==0.2.0 PyOpenGL==3.1.6 pyparsing==3.0.9 pyrsistent==0.19.3 pysimdjson==3.2.0 PySocks==1.7.1 pystan==3.3.0 pytest==3.6.4 python-apt==2.0.1 python-dateutil==2.8.2 python-Levenshtein==0.20.9 python-louvain==0.16 python-slugify==7.0.0 python-utils==3.4.5 pytz==2022.7 pyviz-comms==2.2.1 PyWavelets==1.4.1 PyYAML==6.0 pyzmq==23.2.1 qdldl==0.1.5.post2 qudida==0.0.4 rapidfuzz==2.13.7 regex==2022.6.2 requests==2.25.1 requests-oauthlib==1.3.1 requests-unixsocket==0.2.0 resampy==0.4.2 rpy2==3.5.5 rsa==4.9 scikit-image==0.18.3 scikit-learn==1.0.2 scipy==1.7.3 screen-resolution-extra==0.0.0 scs==3.2.2 seaborn==0.11.2 Send2Trash==1.8.0 setuptools-git==1.2 shapely==2.0.0 six==1.15.0 sklearn-pandas==1.8.0 smart-open==6.3.0 smmap==5.0.0 snowballstemmer==2.2.0 sortedcontainers==2.4.0 soundfile==0.11.0 spacy==3.4.4 spacy-legacy==3.0.11 spacy-loggers==1.0.4 Sphinx==3.5.4 sphinxcontrib-devhelp==1.0.2 sphinxcontrib-htmlhelp==2.0.0 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.3 sphinxcontrib-serializinghtml==1.1.5 sphinxcontrib.applehelp==1.0.3 SQLAlchemy==1.4.46 sqlparse==0.4.3 srsly==2.4.5 statsmodels==0.12.2 sympy==1.7.1 tables==3.7.0 tabulate==0.8.10 tblib==1.7.0 tenacity==8.1.0 tensorboard==2.9.1 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow==2.9.2 tensorflow-datasets==4.8.1 tensorflow-estimator==2.9.0 tensorflow-gcs-config==2.9.1 tensorflow-hub==0.12.0 tensorflow-io-gcs-filesystem==0.29.0 tensorflow-metadata==1.12.0 tensorflow-probability==0.17.0 termcolor==2.2.0 terminado==0.13.3 testpath==0.6.0 text-unidecode==1.3 textblob==0.15.3 thinc==8.1.6 threadpoolctl==3.1.0 tifffile==2022.10.10 toml==0.10.2 tomli==2.0.1 toolz==0.12.0 torch @ https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl torchaudio @ https://download.pytorch.org/whl/cu116/torchaudio-0.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl torchsummary==1.5.1 torchtext==0.14.1 torchvision @ https://download.pytorch.org/whl/cu116/torchvision-0.14.1%2Bcu116-cp38-cp38-linux_x86_64.whl tornado==6.0.4 tqdm==4.64.1 traitlets==5.7.1 tweepy==3.10.0 typeguard==2.7.1 typer==0.7.0 types-pytz==2022.7.1.0 typing_extensions==4.4.0 tzlocal==1.5.1 Unidecode==1.3.6 uritemplate==4.1.1 urllib3==1.24.3 vega-datasets==0.9.0 wasabi==0.10.1 wcwidth==0.2.5 webargs==8.2.0 webencodings==0.5.1 Werkzeug==1.0.1 widgetsnbextension==3.6.1 wordcloud==1.8.2.2 wrapt==1.14.1 xarray==2022.12.0 xarray-einstats==0.4.0 xgboost==0.90 xkit==0.0.0 xlrd==1.2.0 xlwt==1.3.0 yarl==1.8.2 yellowbrick==1.5 zict==2.2.0 zipp==3.11.0

gotec commented 1 year ago

Hi Christian

I've just released git2net 1.6.2 which includes a fix for this.

Interestingly, on my machine the respective commits did not throw an error but simply never finished. Due to the multicore processing, the commits were ultimately just never processed but did not cause the rest of the mining process to break.

With the fix, all commits for both jekyll and litecoin were mined for me.

If it works for you too, please feel free to close this issue. Otherwise, I will do so in a couple of days.

Cheers, Christoph

cmtg commented 1 year ago

Already included the new version into my notebook.... Thanks for the quick release!