vmarkovtsev commented 5 years ago

Due to the bug in the code for typos dataset extraction, I detected a small number of typos in 120k random repositories. I need to apply our post-filtering to the results to measure how many are real typos. Then we can calculate the chance to find a typo by random chance.

Data: typos_random.zip

EgorBu commented 5 years ago

I will launch https://github.com/src-d/style-analyzer/blob/master/lookout/style/typos/research/eval_dataset/prepare_dataset.py (with small modifications to read a new type of data) on this dataset to understand better problems

zurk commented 5 years ago

Also, @irinakhismatullina add additional advanced filtering to solve https://github.com/src-d/ml-backlog/issues/57 in https://github.com/src-d/style-analyzer/blob/master/lookout/style/typos/research/eval_dataset/filter_dataset.py I think it is worth to run it too.

EgorBu commented 5 years ago

yep, worth to add this step

EgorBu commented 5 years ago

Summary:

The initial number of samples is 1364 (one sample includes one typo only)

Deduplication by wrong/correct field

Number of good samples after deduplication 1354. Not a huge difference probably because of subsampling.

Important notice - all checks are done independently on deduplicated dataset to have an understanding of the importance of this type of check.

Check number of subtokens (bad example: `OneTwoThree` & `OneTwo`)

Number of good samples with different number of tokens 950. Bad examples:

getSafeNode | getOrSetNode
start | realtimeRecognizerStart
ReadAllLines | ReadLines

Big Demerau-Levenshtein distance (3+ number of edits to match subtokens)

Number of good samples with small Demerau-Levenshtein distance 817

Bad examples:

storyMenuY | tabMenuY
exitGate | exitLatch
availClaims | totalClaims

Different identifiers because of style (example: `Interchange_level` & `InterchangeLevel`)

Number of good samples where identifiers are different not because of style 876

Identifiers that are equal after lemmatization

Number of good samples that are not equal after lemmatization is 834

Bad examples:

getTrigger | getTriggers
record | records
deleteResource | deleteResources

Number of good samples after all checks - 165 out of 1364 initial samples

src-d / ml-backlog

[research] Check how many "buggy" typos are real #71

Summary:

Deduplication by wrong/correct field

Important notice - all checks are done independently on deduplicated dataset to have an understanding of the importance of this type of check.

Check number of subtokens (bad example: `OneTwoThree` & `OneTwo`)

Big Demerau-Levenshtein distance (3+ number of edits to match subtokens)

Different identifiers because of style (example: `Interchange_level` & `InterchangeLevel`)

Identifiers that are equal after lemmatization

Number of good samples after all checks - 165 out of 1364 initial samples

src-d / ml-backlog

[research] Check how many "buggy" typos are real #71

Summary:

Deduplication by wrong/correct field

Important notice - all checks are done independently on deduplicated dataset to have an understanding of the importance of this type of check.

Check number of subtokens (bad example: OneTwoThree & OneTwo)

Big Demerau-Levenshtein distance (3+ number of edits to match subtokens)

Different identifiers because of style (example: Interchange_level & InterchangeLevel)

Identifiers that are equal after lemmatization

Number of good samples after all checks - 165 out of 1364 initial samples

Check number of subtokens (bad example: `OneTwoThree` & `OneTwo`)

Different identifiers because of style (example: `Interchange_level` & `InterchangeLevel`)