Closed cmtg closed 1 year ago
Hi Christian
Thanks for your message. I was able to replicate the bug based on your example and have developed a fix.
Essentially, there is no way to check names that start with '--' using the git mailmap command, as it will always interpret the name as an option. Therefore, for these rare cases, the next best option is to only check the mailman based on the email, which should give the same result.
My test of this solution, mining the entire Jekyll repo, is still running. I will release a new version of git2net with the fix as soon as this is completed.
Cheers, Christoph
Hello Christoph,
Thanks for the quick answer!
I did found a similar problem for within the Litecoin repository (https://github.com/litecoin-project/litecoin.git). In that case the bogus author name is "--author=Satoshi Nakamoto".
Just in case you want to test the new code with more than one repository.
Thanks again
Christian
Hi Christian
I’ll test it with litecoin too.
Out of curiosity: At this point I have mined over 20,000 repositories with git2net but never encountered this case. Are you using any specific options during mining? Which version of git and which OS are you using?
Im asking this as, like I said, my test in real data is still running so I’m curious if I will actually be able to replicate the error there too.
Cheers, Christoph
Hello Christoph,
this is (essentially) the code I am running:
os.system("cd /tmp; git clone https://github.com/litecoin-project/litecoin.git") git2net.mine_git_repo("/tmp/litecoin", "/tmp/litecoin-project__litecoin.db")
I am using Google Colab Notebooks, so maybe versioning is part of the root cause.
Please find below the output of pip3 freeze:
absl-py==1.3.0 aeppl==0.0.33 aesara==2.7.9 aiohttp==3.8.3 aiosignal==1.3.1 alabaster==0.7.12 albumentations==1.2.1 altair==4.2.0 appdirs==1.4.4 arviz==0.12.1 astor==0.8.1 astropy==4.3.1 astunparse==1.6.3 async-timeout==4.0.2 atari-py==0.2.9 atomicwrites==1.4.1 attrs==22.2.0 audioread==3.0.0 autograd==1.5 Babel==2.11.0 backcall==0.2.0 beautifulsoup4==4.6.3 bleach==5.0.1 blis==0.7.9 bokeh==2.3.3 branca==0.6.0 bs4==0.0.1 CacheControl==0.12.11 cachetools==5.2.1 catalogue==2.0.8 certifi==2022.12.7 cffi==1.15.1 cftime==1.6.2 chardet==4.0.0 charset-normalizer==2.1.1 click==7.1.2 clikit==0.6.2 cloudpickle==2.2.0 cmake==3.22.6 cmdstanpy==1.0.8 colorcet==3.0.1 colorlover==0.3.0 community==1.0.0b1 confection==0.0.3 cons==0.4.5 contextlib2==0.5.5 convertdate==2.4.0 crashtest==0.3.1 crcmod==1.7 cufflinks==0.17.3 cvxopt==1.3.0 cvxpy==1.2.3 cycler==0.11.0 cymem==2.0.7 Cython==0.29.33 daft==0.0.4 dask==2022.2.1 datascience==0.17.5 db-dtypes==1.0.5 dbus-python==1.2.16 debugpy==1.0.0 decorator==4.4.2 defusedxml==0.7.1 descartes==1.1.0 dill==0.3.6 distributed==2022.2.1 dlib==19.24.0 dm-tree==0.1.8 dnspython==2.2.1 docutils==0.16 dopamine-rl==1.0.5 earthengine-api==0.1.335 easydict==1.10 ecos==2.0.12 editdistance==0.5.3 en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl entrypoints==0.4 ephem==4.1.4 et-xmlfile==1.1.0 etils==1.0.0 etuples==0.3.8 fa2==0.3.5 fastai==2.7.10 fastcore==1.5.27 fastdownload==0.0.7 fastdtw==0.3.4 fastjsonschema==2.16.2 fastprogress==1.0.3 fastrlock==0.8.1 feather-format==0.4.1 filelock==3.9.0 firebase-admin==5.3.0 fix-yahoo-finance==0.0.22 Flask==1.1.4 flatbuffers==1.12 folium==0.12.1.post1 frozenlist==1.3.3 fsspec==2022.11.0 future==0.16.0 gambit-disambig==1.0.3 gast==0.4.0 GDAL==3.0.4 gdown==4.4.0 gensim==3.6.0 geographiclib==1.52 geopy==1.17.0 gin-config==0.5.0 git2net==1.6.1 gitdb==4.0.10 GitPython==3.1.27 glob2==0.7 google==2.0.3 google-api-core==2.11.0 google-api-python-client==2.70.0 google-auth==2.16.0 google-auth-httplib2==0.1.0 google-auth-oauthlib==0.4.6 google-cloud-bigquery==3.4.1 google-cloud-bigquery-storage==2.17.0 google-cloud-core==2.3.2 google-cloud-datastore==2.11.1 google-cloud-firestore==2.7.3 google-cloud-language==2.6.1 google-cloud-storage==2.7.0 google-cloud-translate==3.8.4 google-colab @ file:///colabtools/dist/google-colab-1.0.0.tar.gz google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.4.0 googleapis-common-protos==1.58.0 googledrivedownloader==0.4 graphviz==0.10.1 greenlet==2.0.1 grpcio==1.51.1 grpcio-status==1.48.2 gspread==3.4.2 gspread-dataframe==3.0.8 gym==0.25.2 gym-notices==0.0.8 h5py==3.1.0 HeapDict==1.0.1 hijri-converter==2.2.4 holidays==0.18 holoviews==1.14.9 html5lib==1.0.1 httpimport==0.5.18 httplib2==0.17.4 httpstan==4.6.1 humanize==0.5.1 hyperopt==0.1.2 idna==2.10 imageio==2.9.0 imagesize==1.4.1 imbalanced-learn==0.8.1 imblearn==0.0 imgaug==0.4.0 importlib-metadata==6.0.0 importlib-resources==5.10.2 imutils==0.5.4 inflect==2.1.0 intel-openmp==2023.0.0 intervaltree==2.1.0 ipykernel==5.3.4 ipython==7.9.0 ipython-genutils==0.2.0 ipython-sql==0.3.9 ipywidgets==7.7.1 itsdangerous==1.1.0 jax==0.3.25 jaxlib @ https://storage.googleapis.com/jax-releases/cuda11/jaxlib-0.3.25+cuda11.cudnn805-cp38-cp38-manylinux2014_x86_64.whl jieba==0.42.1 Jinja2==2.11.3 joblib==1.2.0 jpeg4py==0.1.4 jsonschema==4.3.3 jupyter-client==6.1.12 jupyter-console==6.1.0 jupyter_core==5.1.3 jupyterlab-widgets==3.0.5 kaggle==1.5.12 kapre==0.3.7 keras==2.9.0 Keras-Preprocessing==1.1.2 keras-vis==0.4.1 kiwisolver==1.4.4 korean-lunar-calendar==0.3.1 langcodes==3.3.0 Levenshtein==0.20.9 libclang==15.0.6.1 librosa==0.8.1 lightgbm==2.2.3 lizard==1.17.10 llvmlite==0.39.1 lmdb==0.99 locket==1.0.0 logical-unification==0.4.5 LunarCalendar==0.0.9 lxml==4.9.2 Markdown==3.4.1 MarkupSafe==2.0.1 marshmallow==3.19.0 matplotlib==3.2.2 matplotlib-venn==0.11.7 miniKanren==1.0.3 missingno==0.5.1 mistune==0.8.4 mizani==0.7.3 mkl==2019.0 mlxtend==0.14.0 more-itertools==9.0.0 moviepy==0.2.3.5 mpmath==1.2.1 msgpack==1.0.4 multidict==6.0.4 multipledispatch==0.6.0 multitasking==0.0.11 murmurhash==1.0.9 music21==5.5.0 natsort==5.5.0 nbconvert==5.6.1 nbformat==5.7.1 netCDF4==1.6.2 networkx==3.0 nibabel==3.0.2 nltk==3.7 notebook==5.7.16 numba==0.56.4 numexpr==2.8.4 numpy==1.21.6 oauth2client==4.1.3 oauthlib==3.2.2 okgrade==0.4.3 opencv-contrib-python==4.6.0.66 opencv-python==4.6.0.66 opencv-python-headless==4.7.0.68 openpyxl==3.0.10 opt-einsum==3.3.0 osqp==0.6.2.post0 packaging==21.3 palettable==3.3.0 pandas==1.3.5 pandas-datareader==0.9.0 pandas-gbq==0.17.9 pandas-profiling==1.4.1 pandocfilters==1.5.0 panel==0.12.1 param==1.12.3 parso==0.8.3 partd==1.3.0 pastel==0.2.1 pathlib==1.0.1 pathpy2==2.2.0 pathy==0.10.1 patsy==0.5.3 pep517==0.13.0 pexpect==4.8.0 pickleshare==0.7.5 Pillow==7.1.2 pip-tools==6.6.2 platformdirs==2.6.2 plotly==5.5.0 plotnine==0.8.0 pluggy==0.7.1 pooch==1.6.0 portpicker==1.3.9 prefetch-generator==1.0.3 preshed==3.0.8 prettytable==3.6.0 progressbar2==3.38.0 prometheus-client==0.15.0 promise==2.3 prompt-toolkit==2.0.10 prophet==1.1.1 proto-plus==1.22.2 protobuf==3.19.6 psutil==5.4.8 psycopg2==2.9.5 ptyprocess==0.7.0 py==1.11.0 pyarrow==9.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycocotools==2.0.6 pycparser==2.21 pyct==0.4.8 pydantic==1.10.4 pydata-google-auth==1.5.0 pydot==1.3.0 pydot-ng==2.0.0 pydotplus==2.0.2 PyDriller==2.1 PyDrive==1.3.1 pyemd==0.5.1 pyerfa==2.0.0.1 Pygments==2.6.1 PyGObject==3.36.0 pyjarowinkler==1.8 pylev==1.4.0 pymc==4.1.4 PyMeeus==0.5.12 pymongo==4.3.3 pymystem3==0.2.0 PyOpenGL==3.1.6 pyparsing==3.0.9 pyrsistent==0.19.3 pysimdjson==3.2.0 PySocks==1.7.1 pystan==3.3.0 pytest==3.6.4 python-apt==2.0.1 python-dateutil==2.8.2 python-Levenshtein==0.20.9 python-louvain==0.16 python-slugify==7.0.0 python-utils==3.4.5 pytz==2022.7 pyviz-comms==2.2.1 PyWavelets==1.4.1 PyYAML==6.0 pyzmq==23.2.1 qdldl==0.1.5.post2 qudida==0.0.4 rapidfuzz==2.13.7 regex==2022.6.2 requests==2.25.1 requests-oauthlib==1.3.1 requests-unixsocket==0.2.0 resampy==0.4.2 rpy2==3.5.5 rsa==4.9 scikit-image==0.18.3 scikit-learn==1.0.2 scipy==1.7.3 screen-resolution-extra==0.0.0 scs==3.2.2 seaborn==0.11.2 Send2Trash==1.8.0 setuptools-git==1.2 shapely==2.0.0 six==1.15.0 sklearn-pandas==1.8.0 smart-open==6.3.0 smmap==5.0.0 snowballstemmer==2.2.0 sortedcontainers==2.4.0 soundfile==0.11.0 spacy==3.4.4 spacy-legacy==3.0.11 spacy-loggers==1.0.4 Sphinx==3.5.4 sphinxcontrib-devhelp==1.0.2 sphinxcontrib-htmlhelp==2.0.0 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.3 sphinxcontrib-serializinghtml==1.1.5 sphinxcontrib.applehelp==1.0.3 SQLAlchemy==1.4.46 sqlparse==0.4.3 srsly==2.4.5 statsmodels==0.12.2 sympy==1.7.1 tables==3.7.0 tabulate==0.8.10 tblib==1.7.0 tenacity==8.1.0 tensorboard==2.9.1 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow==2.9.2 tensorflow-datasets==4.8.1 tensorflow-estimator==2.9.0 tensorflow-gcs-config==2.9.1 tensorflow-hub==0.12.0 tensorflow-io-gcs-filesystem==0.29.0 tensorflow-metadata==1.12.0 tensorflow-probability==0.17.0 termcolor==2.2.0 terminado==0.13.3 testpath==0.6.0 text-unidecode==1.3 textblob==0.15.3 thinc==8.1.6 threadpoolctl==3.1.0 tifffile==2022.10.10 toml==0.10.2 tomli==2.0.1 toolz==0.12.0 torch @ https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl torchaudio @ https://download.pytorch.org/whl/cu116/torchaudio-0.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl torchsummary==1.5.1 torchtext==0.14.1 torchvision @ https://download.pytorch.org/whl/cu116/torchvision-0.14.1%2Bcu116-cp38-cp38-linux_x86_64.whl tornado==6.0.4 tqdm==4.64.1 traitlets==5.7.1 tweepy==3.10.0 typeguard==2.7.1 typer==0.7.0 types-pytz==2022.7.1.0 typing_extensions==4.4.0 tzlocal==1.5.1 Unidecode==1.3.6 uritemplate==4.1.1 urllib3==1.24.3 vega-datasets==0.9.0 wasabi==0.10.1 wcwidth==0.2.5 webargs==8.2.0 webencodings==0.5.1 Werkzeug==1.0.1 widgetsnbextension==3.6.1 wordcloud==1.8.2.2 wrapt==1.14.1 xarray==2022.12.0 xarray-einstats==0.4.0 xgboost==0.90 xkit==0.0.0 xlrd==1.2.0 xlwt==1.3.0 yarl==1.8.2 yellowbrick==1.5 zict==2.2.0 zipp==3.11.0
Hi Christian
I've just released git2net 1.6.2
which includes a fix for this.
Interestingly, on my machine the respective commits did not throw an error but simply never finished. Due to the multicore processing, the commits were ultimately just never processed but did not cause the rest of the mining process to break.
With the fix, all commits for both jekyll
and litecoin
were mined for me.
If it works for you too, please feel free to close this issue. Otherwise, I will do so in a couple of days.
Cheers, Christoph
Already included the new version into my notebook.... Thanks for the quick release!
PROBLEM:
While running the extraction on https://github.com/jekyll/jekyll.git, I got the following error message:
-----------------------8< snip------------------------------------
GitCommandError Traceback (most recent call last) in
----> 1 git.Git('/tmp/git2net/jekyll').check_mailmap('--global fen@nice.lgbt')
2 frames /usr/local/lib/python3.8/dist-packages/git/cmd.py in(*args, kwargs)
637 if name[0] == '_':
638 return LazyMixin.getattr(self, name)
--> 639 return lambda *args, *kwargs: self._call_process(name, args, kwargs)
640
641 def set_persistent_git_options(self, **kwargs: Any) -> None:
/usr/local/lib/python3.8/dist-packages/git/cmd.py in _call_process(self, method, *args, kwargs) 1182 call.extend(args_list) 1183 -> 1184 return self.execute(call, exec_kwargs) 1185 1186 def _parse_object_header(self, header_line: str) -> Tuple[str, str, int]:
/usr/local/lib/python3.8/dist-packages/git/cmd.py in execute(self, command, istream, with_extended_output, with_exceptions, as_process, output_stream, stdout_as_string, kill_after_timeout, with_stdout, universal_newlines, shell, env, max_chunk_size, **subprocess_kwargs) 982 983 if with_exceptions and status != 0: --> 984 raise GitCommandError(redacted_command, status, stderr_value, stdout_value) 985 986 if isinstance(stdout_value, bytes) and stdout_as_string: # could also be output_stream
GitCommandError: Cmd('git') failed due to: exit code(129) cmdline: git check-mailmap --global fen@nice.lgbt stderr: 'error: unknown option `global fen@nice.lgbt' usage: git check-mailmap [] ...
-----------------------8< snap------------------------------------
POSSIBLE ROOT CAUSE:
My understanding of the problem is a bogus author name in the repository: "fen@nice.lgbt" used "--global" as a name.
A quick check with pydriller confirmed that suspicion:
from pydriller import Repository repo = Repository("https://github.com/jekyll/jekyll.git") set([c.author.name+' # '+c.author.email for c in repo.traverse_commits() if c.author.email.endswith('lgbt')])
Output: {'--global # fen@nice.lgbt', 'fen # fen@nice.lgbt', 'jona # fen@nice.lgbt', 'penny # penny@penny.lgbt'}
POSSIBLE SOLUTION:
A quick&dirty fix might be to check for leading dashes in the name in order to escape the dashes, to remove dashes or to ignore the entire name.