Closed vmarkovtsev closed 5 years ago
I will launch https://github.com/src-d/style-analyzer/blob/master/lookout/style/typos/research/eval_dataset/prepare_dataset.py (with small modifications to read a new type of data) on this dataset to understand better problems
Also, @irinakhismatullina add additional advanced filtering to solve https://github.com/src-d/ml-backlog/issues/57 in https://github.com/src-d/style-analyzer/blob/master/lookout/style/typos/research/eval_dataset/filter_dataset.py I think it is worth to run it too.
yep, worth to add this step
The initial number of samples is 1364 (one sample includes one typo only)
Number of good samples after deduplication 1354. Not a huge difference probably because of subsampling.
OneTwoThree
& OneTwo
)Number of good samples with different number of tokens 950. Bad examples:
getSafeNode | getOrSetNode
start | realtimeRecognizerStart
ReadAllLines | ReadLines
Number of good samples with small Demerau-Levenshtein distance 817
Bad examples:
storyMenuY | tabMenuY
exitGate | exitLatch
availClaims | totalClaims
Interchange_level
& InterchangeLevel
)Number of good samples where identifiers are different not because of style 876
Number of good samples that are not equal after lemmatization is 834
Bad examples:
getTrigger | getTriggers
record | records
deleteResource | deleteResources
Due to the bug in the code for typos dataset extraction, I detected a small number of typos in 120k random repositories. I need to apply our post-filtering to the results to measure how many are real typos. Then we can calculate the chance to find a typo by random chance.
Data: typos_random.zip