stringandstickytape / RegulatedNoise

20 stars 16 forks source link

Define a list of Station Name Suffixes, and use this to "intelligently" insert missing spaces into station names? #24

Open stringandstickytape opened 9 years ago

stringandstickytape commented 9 years ago

This might be harder than expected. For instance, Maddavo's stations file lists:

LUYTEN 674-15 => Nobleport

If that's correct, we have no way of knowing that this is a one-word station name, and doesn't have the suffix " Port". But maybe this is a best-effort algorithm that can't get it right every time. Or maybe the correct fix is to keep updating station.csv and hope the problem falls away over time.

This list of station name suffixes was extracted from Maddavo's stations file:

Anderton Andrade Apology Arena Asylum Ayres Base Beacon Camp Centre Chernobyl City Claim Coliseum Colony Co-Operative Cousens Depot Dive Dixon Doc Dock Enterprise Escape Estate Exchange Exile Eyrie Folly Fort Foundation Freeport Gambit Gate Gateway Goose Halt Ham Hanger Hangout Harrison Haven Hideout High Hold Holdings Holm Home Hope Horizons Hospital Hq Hub Inheritance Installation Jao Klarix Lab Laboratory Lambada Landing Lane Legacy Lincoln Lofthus Lucas Manoevre Manwaring Market Masters Matt Mausoleum Memorial Mine Mines Mojo Mortuary Nest Orbital Orbiter Outpost Owl Park Phoenix Plant Platform Point Port Post Pride Principality Progress Prospect Reach Refinery Reformatory Relay Research Reserve Rest Retreat Ring Sanctuary Scott Settlement Shipyard Silo Spaceport Station Stop Survey Terminal Thiemann Town Vision Vista Wart Way Works Yola Young

stringandstickytape commented 9 years ago

Code to extract suffixes and prefixes from Maddavo's data:

        List<string> suffixes = new List<string>();
        List<string> prefixes = new List<string>();

        StreamReader reader = File.OpenText(".//station.csv");

        reader.ReadLine();

        while (!reader.EndOfStream)
        {
            var line = reader.ReadLine();
            var values = line.Split(',');
            var suffix = values[1];
            string prefix = "";

            suffix = suffix.Substring(1, suffix.Length - 2);

            if (!suffix.Contains(' '))
            {
                prefixes.Add(suffix);
            }
            else
            {
                prefixes.Add(suffix.Substring(0,suffix.LastIndexOf(' ')));
                suffix = suffix.Substring(suffix.LastIndexOf(' ') + 1);
                suffixes.Add(suffix);
                if(suffix.Contains("Wagar"))
                    Debug.WriteLine("!");
            }
        }
        reader.Close();

        using(var file = new System.IO.StreamWriter(".//suffixes.txt"))
        {
            suffixes = suffixes.Distinct().OrderBy(x => x).ToList();
            foreach (var x in suffixes)
                file.WriteLine(x);
            file.Close();
        }

        using (var file = new System.IO.StreamWriter(".//prefixes.txt"))
        {
            prefixes = prefixes.Distinct().OrderBy(x => x).ToList();
            foreach (var x in prefixes)
                file.WriteLine(x);
            file.Close();
        }
stringandstickytape commented 9 years ago

Hm. now that dumb bug is fixed, we should reassess, This may not be necessary at all, Tesseract is pretty good at getting the spaces right if no-one Replaces them back out again...

Lknechtli commented 9 years ago

I've still had some stations missing the spaces, but now it's only about 5-10% of the time rather than 70%.