Add typos preprocessing notebook - Githubissues

src-d / style-analyzer

Lookout Style Analyzer: fixing code formatting and typos during code reviews

GNU Affero General Public License v3.0

32 stars 21 forks source link

Add typos preprocessing notebook #728

Closed EgorBu closed 5 years ago

EgorBu commented 5 years ago

a notebook
an initial dataset with typos Signed-off-by: egor egor@sourced.tech

zurk commented 5 years ago

@vmarkovtsev can you tell how to generate this dataset? I think it is important to add a note here. And do we fix commit hash to the one where typo was introduced?

EgorBu commented 5 years ago

It's not in this dataset definitely. It has only 1 commit hash

vmarkovtsev commented 5 years ago

@zurk Generating this dataset was a bloody hell which I am ashamed to even mention here :D I will upload some scripts and docs to research once I have time

vmarkovtsev commented 5 years ago

Commit hashes should be added by somebody, no resources currently.

zurk commented 5 years ago

@EgorBu can I ask you add comment hashes where typo was introduced? It should be possible with git blame.

vmarkovtsev commented 5 years ago

There is a problem though: the line which is saved in the dataset is a line in the new commit.

vmarkovtsev commented 5 years ago

However, due to the lucky bug, I think it should match the line ion the old commit. We should check it.

vmarkovtsev commented 5 years ago

@EgorBu CI must pass, most likely we need another exclusion for research

EgorBu commented 5 years ago

======================================================================
ERROR: test_train_from_scratch (lookout.style.typos.tests.test_preparation.TrainingTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/style-analyzer/lookout/style/typos/tests/test_preparation.py", line 132, in test_train_from_scratch
    model = train_from_scratch(config)
  File "/style-analyzer/lookout/style/typos/preparation.py", line 268, in train_from_scratch
    prepared_data = prepare_data(config["preparation"])
  File "/style-analyzer/lookout/style/typos/preparation.py", line 82, in prepare_data
    data = pandas.read_csv(raw_data_path, index_col=0, keep_default_na=False)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 709, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 449, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 818, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1049, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1695, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 562, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 760, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2197, in pandas._libs.parsers.raise_parser_error
  File "/usr/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.6/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
_lzma.LZMAError: Input format not supported by decoder

it failed only at one python version. And passed everything else :suspect:

zurk commented 5 years ago

@EgorBu I had the same problem. Rerun help me to solve it: https://github.com/src-d/style-analyzer/pull/721#issuecomment-477525516