Closed zurk closed 5 years ago
I haven't checked the exact numbers, I only know that great part of wrong corrections comes from such cases.
Here're the first several wrong corrections:
pos | wrong | sugg 0 | correct |
---|---|---|---|
4 | RightSingleQoutationMark | RightSingleRotationMark | RightSingleQuotationMark |
5 | specifiyHeaders | specificHeaders | specifyHeaders |
6 | getFeatures | getFeatures | getFeature |
15 | SupportTypes | SupportTypes | SupportedTypes |
16 | ReportedDisabledException | ReportedDisabledException | ReporterDisabledException |
19 | LiferyPageTopPhaseListenerCompat | LifePageTopPhaseListenerCompat | LiferayPageTopPhaseListenerCompat |
20 | fuelguage | fuelguage | fuelgauge |
21 | Issue150RegressionAsync | Issue150RegressionAsync | Issue1507RegressionAsync |
24 | deliver_method | deliver_method | delivery_method |
27 | insertingAEntityWillSetItsIdentifier | insertingAEntityWillSetItsIdentifier | insertingAnEntityWillSetItsIdentifier |
30 | RunAlbumGrouWorkflows | RunAlbumGroupWorkflow | RunAlbumGroupWorkflows |
32 | colors_initialized | colors_initialized | colorsinitialized |
33 | getBlacklistedAlgorithmsURIs | getBlacklistedAlgorithmsURIs | getBlacklistedAlgorithmURIs |
35 | AcknowledgementCode | AcknowledgementCode | AcknowledgmentCode |
36 | oldToNewCfgNameMappingImpl | oldToNewCfgNameMappingImpl | OldToNewCfgNameMappingImpl |
So what problems I see here:
And that's only the first 15 lines, so maybe there're some more further, I suggest we look into it when the time comes to work on the issue.
@irinakhismatullina I think this one is done in https://github.com/src-d/style-analyzer/pull/763, right?
Most part of it, yes. I've removed as much bad examples as possible, so that not to remove any real typos. In the future we may need more filtering, but for now imo it's done.
After we have the final dataset collected we should improve our filtering tool: https://github.com/src-d/style-analyzer/blob/master/lookout/style/typos/research/eval_dataset/prepare_dataset.py
There are still many "typos" where
-ed
or-s
ending was changed. One of the ways is to use stemmer to base and changed tokens (identifier parts) to see if there any real difference in it. But we should be sure that real typos remain unfiltered.@irinakhismatullina if you have undestanding how many typos are bad in our dataset please tell.