Importing ascent process

scd commented 12 years ago

I have got some beta test code for you to put through the paces with your Sachsen logbook import. Please use the attached CSV (it's your file plus columns for country and crag and converted to UTF8 format). I have also CC'd Campbell, Adam and Brendan in case they are bored and also want to try this in dev.

To start the import log into dev and use the following url (the id is for Nicky's account, please replace with your own accounts id if you are not Nicky) . Note that you can come back to it at any point.

http://dev.thecrag.com/processmap/importlogbook/190453269

Note that after you have uploaded the file you will be asked to configure the columns by telling the system which column corresponds with which input. You should also disable 'Use country grading' and set your import to 'Saxon'. Your date format is 'dd/mm/yyyy'.

If you cannot finish the process in one sitting it remembers where you were up to.

I have done a trial run through on my system and of your 220 odd ascents it easily imported about 200 of them against a route. There were about 16 cases I deferred because it needed a bit more knowledge to work out which route the ascent should be associated with and 6 cases where a new route was created in the Orphans directory off Sachsen. I think this is pretty cool considering that every second route is called 'Alter Weg'.

It's not particularly fast because it does one route at a time and asks you to confirm each time. Eventually when I'm a lot more comfortable that all is working well we might be able to automate this a bit more. Also remember that this is in dev so this is just to test the functionality - any uploading you do in dev will be lost.

Please note that I am not trying to make this pretty, just functional. There is so much to get right in the function of this import that I don't want to be distracted too much with look and feel issues.

Campbell, Brendan and Adam the file I have attached is something that Nicky had on a spreadsheet 15 years ago but had lost the electronic copy. He had a printed copy which he OCR'ed so this is pretty cool test.

scd commented 12 years ago

From Nicky...

Hi Simon and all others

thank you for the first version of the long desired ascent import. Your implemented search and match algorithm seems to work pretty well. Some points or suggestions I would like to made:

 first of all only for the background: Why there are so many "Alter Weg" in Sächsische Schweiz? Meanwhile you certainly got to know that Sächsiche Schweiz is THE crag of rules, traditions and history. Alter Weg stands for "Normal route" and it is an unwritten rule that the very first route on each cliff is called "Alter Weg".
general process (setup): 
    UTF-8 encoding: good and very important
    possibility to handle multiple import files at the same time: very good especially if it help the import process to split up the file into chunks of connected ascents (same country, crag or ...)
    pause and resume later with the import : very good
    skip or withdraw uncertain lines: very good
fuzzy search for routes: the result of the search looks very good to me.
    It might help (only for the testing and development phase) if you could print out your matching score (levenshtein similarity, tf-idf, ... whatever you /or the fulltext index provides) for the found candidates.
    the result of two word names are some times not so good. Because of the difficulty of none standardized word order of adjectives or describing meta data. Fictive examples:
        Southern XYZ Tower <-> XYZ Tower, Southern <-> XYZ Tower (South) ...
        Variation to XYZ Route <-> XYZ Route Variation ...
        Maybe you could add some hidden additional iterations with changed word orders and add the very best result over a certain bias to the result list
search for areas/sectors:
    in the setup for the import you are asking for a crag and and/or cliff column. It might be a more general approach if you ask the user to assign all subarea columns (independent of type) and then assume child relation from left to right. This would also be an design question for the UI later on.
    maybe it would help the entire import process if you first of all do a "group by with roleup" (with ascent count aggregation) for all areas columns (Country, region, crag, sector ... ) in the import file and do an "area search" (or even country search) for all candidates (this might sound a bit complicated but I guess a good UI could combine many steps of this into a single action/view):
        if only minor (2 or less) routes per subarea has been found that go ahead like now
        if more routes found add the area search: 
            "The System founds a new country: 'Germany' do you mean Select<Germany,...>?"
            "The System founds a new area: 'Rathener Gebiet' do you mean Select<Rathen,...> (only children of the parents - already confirmed in a previous step)
    If a area is well matched the later route search will perform much better due to the pruned searchtree
Dealing with new routes (not found in index):
    If an area/sector was found in the index (see above) but no route does match than I would propose to add the route to the deepest subarea "orphan-node".

    For "World -> Europe -> Germany -> Sächsische Schweiz -> Bielatal -> Spannagelturm - Verlassene Wand -> Felix -> Nordostweg" the systems will crate a "orphan-node" in Bielatal. That is to height in the index - there is almost no way to merge this later.
    If the import file provides several  (i.e. > 2) routes/ascents for an subarea (see above) do not add the routes to the orphan-node but direct to the index.
    Also if the matched area (see splitting up od area and route search) already provides a minimum of routes add the new one direct to the index not to the orphan-node
    Good adjusted biases will help to ensure a good index quality but also to expand the index by importing ascents. Maybe the Sachsische Schweiz test import is not the best example due to the comprehensive index in this area but for example in my 8a testimport you will find some crags with more than 20 ascents not even mentioned in theCrag so fare. To join all of them in a (worst case) country level orphan-node will lead to a very difficult clean/reparent/merge process later on
Performance and usability : you pointer out that it is only a functional prototype so far. My testing clearly shows to me how important a good performance and usability has to be. I only tested a few dozen ascents and could not imagine to do that for all 1500 candidates.

I know that some of the suggestions are not so easy to implement but I hope this is not the major criteria. I am looking forward to discuss the import process in more detail - now we have a very promising prototype. Nicky

scd commented 12 years ago

Hidden word order iterations: Good idea to change up the word order to see if there are some further matches. We only need to do it when we see specific words such as South/Southern. Maybe we could create a list of these trigger words. Which lines did not work in your test example.

Search scores: I just use our standard site search backend which does not easily give me scores. If we really need this I will spend the time, but otherwise I think I can visualise the problems when I see a test case which has gone wrong.

Group by range: I will hold off on this one a bit. I would prefer to know what is the actual country because I know that these are unique and I can do some additional special processing. I also need to know the actual crag because if the system cannot find a crag then it will not let you proceed with the import for that line (ie you cannot create orphans at country level). So there is some contextual knowledge about some of the assignments. There is no contextual knowledge with Sector and Cliff though. I need to prove to myself that my assumptions are wrong here before I change over.

Remembering crag assignments: This is actually already in but I had to disable it as I found a last minute problem which required a db change. In the interests of being pragmatic I wanted to get this out to you to test before fixing the issue. I have also got the ability to remember between users for say and 8a import (this comes from the 'Source' field when you first upload the file in the import).

Note that we have done something for importing and syncing website where I match generic tree structures for syncing website content imports. We did this for importing content from ACA which had a parallel database of Australian routes. It was able to work out 'NSW' node on one site matched the 'New South Wales' node on the other site because of the overall tree structure match. It is possible that I could use this for importing ascents, however I have made the assumption that logbooks will be far less structurally consistent so I should treat each line fairly independently (other than remembering previous assignments). If we need to I can go down this path, but there is a lot of work which may end up at a dead end.

Orphans: We will not be allowing Orphans to be created at a Country level. The question then becomes whether we proliferate orphans down to sector and cliff level. My gut feeling is to keep the problem contained at the crag level rather then orphan directories all over the place. Maybe we have one orphan directory but put the cliff and crag in the created route name.

I don't want this tool to be used to create crag directory structures from somebodies logbook. If we don't have content for a crag then I would prefer them to defer the decision to import that line to give us a chance to source the proper crag index. Maybe it might even trigger that user to go do it themselves. There are a couple of users that create the whole crag when they climb in an unindexed area.

My preference is not to import at all costs but rather make it easy to defer decisions and get the right info in the index. Philosophically this is not an instant gratification tool, but a productivity tool. I guess this has been a bit of a point of difference between us and our competitors - the index structure is everything for us, and I'm not convinced that inheriting logbook structures will be good for us.

I'm probably easily convinced we should not allow this orphans import in the first place. Brendan/Campbell what do you think?

Productivity: Getting through 200 routes using this tool is ok, I did it in about an hour to test the process for your import file. I really wanted to get a feel for how much of a pain it would be, and you are right it is totally unacceptable for 1500 routes. For now I am going to hold firm with a business requirement that every line must be reviewable and confirmed by the user before making the assignment by the user.

I am thinking of a slightly different approach:

upload and configure import as before
have a page which asks you if you want to iterate one as before at a time or schedule a batch process. The batch processing may take a while so it would have to email you when it is done.
the batch process would calculate all the possible results and store them against each line.
go to the report and filter all the batch processed results by number of matches. According to the 80/20 rule 80% of the routes will have a one-for-one match so you could just use checkboxes to review and confirm all these in bulk.
you could review all results with unmatched entries to see what is missing from the index in a single report. This could trigger you to do some index updates and then rerun the batch process for unassigned routes.
At any stage you can swap between iterating and batch processing.

We are not going to get around the problem that a lot of content will be missing from thecrag. We are working on another initiative which will hopefully encourage people to contribute big time and enjoy contributing (much like yourself). This will be coordinated soon after the app release.

nicHoch commented 12 years ago

Hidden word order iterations: i do not thinks that it is a good idea to trigger that by content. The management of the keywordlist will be a lot of work (especially the multilanguage version - here in europe most of the route/area names are in native language of the country...). Do you use an third party fulltext engine like lucene? Than it should not be a problem to disable wordorder...

An example of the import:

Input line: 09.04.1996,Germany,Sächsische Schweiz,Bielatal,'Dürebielawächter, Vorderer',Alter Weg,II,Lead,Good

scd commented 12 years ago

I use a perl module called String::Approx and store a dictionary of descendent area and route names (and words) against each node. This means that I can do the search within the context of a particular area. It shouldn't be too difficult to enhance the search to turn off word order. It's a good idea and making the enhancement will make it available for general search.

BTW, we are slowly working on integrating text search with faceted search which may end up in replacing a whole lot of search stuff. But this is a long term issue.

nicHoch commented 12 years ago

The batch idea sounds very useful to me.

I had seen the import of ascents also as a good way to expand the index. But I see that this is not what you have in mind. My fears are that it might distract people if the import of ascents only works for some few percent of the input well (not technically but due to the sparse index). I absolute agree that index quality comes first - but just would like to point out that for Europe the overall route-coverage might be less than 10% (personal guess based on some years climbing in Europe).

I am very interested in the mentioned "contributing initiative" - enhancing the index with distributed power will change a lot.

scd commented 12 years ago

I think that 10% sounds about right. We have 200k routes at of what we estimate to be about 1.5 million routes worldwide. If each climber around the world added just 15 routes then we would have a comprehensive index. Long way to go to let each climber know about us, sign up and then do their part :)

We could load all the ticks in one go really quickly as indexless ticks. The database would handle it, but the UI might barf at some points. The problem with doing this is that the user will lose out on all the cool features of having their ticks linked to a global index.

Maybe we should allow indexless ticks rather than create Orphans directory.

Maybe we should do the import for them first to give us a chance to get the index fixed for the areas they have climbed.

I really don't know what the right answer is here. My plan was to get it working nice for your Sachsen import then look at the problems that come up for your general import.

I will include you on the contributing initiatives sometime soon. It will be nice to get your feedback. It's probably not hold your breath stuff just another piece of a big puzzle.

scd commented 11 years ago

see issue #577

theCrag / website

Importing ascent process #843