src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

[dataset] Improve typos dataset quality #57

Closed zurk closed 5 years ago

zurk commented 5 years ago

After we have the final dataset collected we should improve our filtering tool: https://github.com/src-d/style-analyzer/blob/master/lookout/style/typos/research/eval_dataset/prepare_dataset.py

There are still many "typos" where -ed or -s ending was changed. One of the ways is to use stemmer to base and changed tokens (identifier parts) to see if there any real difference in it. But we should be sure that real typos remain unfiltered.

@irinakhismatullina if you have undestanding how many typos are bad in our dataset please tell.

irinakhismatullina commented 5 years ago

I haven't checked the exact numbers, I only know that great part of wrong corrections comes from such cases.

Here're the first several wrong corrections:

pos wrong sugg 0 correct
4 RightSingleQoutationMark RightSingleRotationMark RightSingleQuotationMark
5 specifiyHeaders specificHeaders specifyHeaders
6 getFeatures getFeatures getFeature
15 SupportTypes SupportTypes SupportedTypes
16 ReportedDisabledException ReportedDisabledException ReporterDisabledException
19 LiferyPageTopPhaseListenerCompat LifePageTopPhaseListenerCompat LiferayPageTopPhaseListenerCompat
20 fuelguage fuelguage fuelgauge
21 Issue150RegressionAsync Issue150RegressionAsync Issue1507RegressionAsync
24 deliver_method deliver_method delivery_method
27 insertingAEntityWillSetItsIdentifier insertingAEntityWillSetItsIdentifier insertingAnEntityWillSetItsIdentifier
30 RunAlbumGrouWorkflows RunAlbumGroupWorkflow RunAlbumGroupWorkflows
32 colors_initialized colors_initialized colorsinitialized
33 getBlacklistedAlgorithmsURIs getBlacklistedAlgorithmsURIs getBlacklistedAlgorithmURIs
35 AcknowledgementCode AcknowledgementCode AcknowledgmentCode
36 oldToNewCfgNameMappingImpl oldToNewCfgNameMappingImpl OldToNewCfgNameMappingImpl

So what problems I see here:

  1. Plurality/tense/part-of-speach changes - even people would have trouble correcting that.
  2. Changes in non-letter symbols (lines 21 and 32 for example).
  3. Case changes (last line).

And that's only the first 15 lines, so maybe there're some more further, I suggest we look into it when the time comes to work on the issue.

zurk commented 5 years ago

@irinakhismatullina I think this one is done in https://github.com/src-d/style-analyzer/pull/763, right?

irinakhismatullina commented 5 years ago

Most part of it, yes. I've removed as much bad examples as possible, so that not to remove any real typos. In the future we may need more filtering, but for now imo it's done.