Open olliebennett opened 10 years ago
https://github.com/coughee/wikistripper I got this far with stripping wiki for the food list, however there are a lot of shitty results in the txt file it generates. I'm not really sure what the best way to exclude anything that isn't a good is.
On 19 June 2014 08:37, Ollie Bennett notifications@github.com wrote:
Load test data into SQLite database to allow faster access to pre-calculated hash values for each string.
— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/3.
Also do we really need a database wrapper? I would have thought we'd always be accessing in order.
On 19 June 2014 09:43, Jonathan Keelan sock.uk@gmail.com wrote:
https://github.com/coughee/wikistripper I got this far with stripping wiki for the food list, however there are a lot of shitty results in the txt file it generates. I'm not really sure what the best way to exclude anything that isn't a good is.
On 19 June 2014 08:37, Ollie Bennett notifications@github.com wrote:
Load test data into SQLite database to allow faster access to pre-calculated hash values for each string.
— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/3.
Yeah, I guess the names would typically be processed in order. Only thought of a database so we could pre-compute the hash values of the inner loop (i.e. band names) once and perhaps do some more interesting or intelligent detection of good matches.
Just brainstorming really - certainly no need to get a database going yet (if at all).
You're right, if we have to do any expensive calculation on the band names then it would make sense to only computer them once, however even those computed values would still be accessed in order. We could just store them in a separate list. I guess it really depends how big the database is. I've still had no luck finding a sensible way to scrape wikipedia lists. Currently there are is all sorts of rubbish in it (both for bands and foods).
On 19 June 2014 15:15, Ollie Bennett notifications@github.com wrote:
Yeah, I guess the names would typically be processed in order. Only thought of a database so we could pre-compute the hash values of the inner loop (i.e. band names) once and perhaps do some more interesting or intelligent detection of good matches.
Just brainstorming really - certainly no need to get a database going yet (if at all).
— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/3#issuecomment-46565385 .
Haha, I fiddled with some of the values and also added a levenshtein check on the metaphones so it will 'fuzzy' match them. Came out with some decent ones... Alice Cooper --> Alice Caper, Green day -> Green Tea Day, Barry white / manilow -> Berry Manilow, bon jovi -> cheese bun jovi (that one cracked me up wtf.), Olivia newton jon -> Olive newton john, Green Day -> Grain Day, the beach boys -> the peach boys, van halen -> coq au vin halen, kaiser chiefs -> chives chiefs, bryan adams -> biryani adams...., barry white -> barry white meat, mariah carey -> mariah curry, kriss kross -> kriss cress,
On 19 June 2014 15:19, Jonathan Keelan sock.uk@gmail.com wrote:
You're right, if we have to do any expensive calculation on the band names then it would make sense to only computer them once, however even those computed values would still be accessed in order. We could just store them in a separate list. I guess it really depends how big the database is. I've still had no luck finding a sensible way to scrape wikipedia lists. Currently there are is all sorts of rubbish in it (both for bands and foods).
On 19 June 2014 15:15, Ollie Bennett notifications@github.com wrote:
Yeah, I guess the names would typically be processed in order. Only thought of a database so we could pre-compute the hash values of the inner loop (i.e. band names) once and perhaps do some more interesting or intelligent detection of good matches.
Just brainstorming really - certainly no need to get a database going yet (if at all).
— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/3#issuecomment-46565385 .
oh wait, kaiser chiefs was kaiser chives.
On 21 June 2014 06:08, Jonathan Keelan sock.uk@gmail.com wrote:
Haha, I fiddled with some of the values and also added a levenshtein check on the metaphones so it will 'fuzzy' match them. Came out with some decent ones... Alice Cooper --> Alice Caper, Green day -> Green Tea Day, Barry white / manilow -> Berry Manilow, bon jovi -> cheese bun jovi (that one cracked me up wtf.), Olivia newton jon -> Olive newton john, Green Day -> Grain Day, the beach boys -> the peach boys, van halen -> coq au vin halen, kaiser chiefs -> chives chiefs, bryan adams -> biryani adams...., barry white -> barry white meat, mariah carey -> mariah curry, kriss kross -> kriss cress,
On 19 June 2014 15:19, Jonathan Keelan sock.uk@gmail.com wrote:
You're right, if we have to do any expensive calculation on the band names then it would make sense to only computer them once, however even those computed values would still be accessed in order. We could just store them in a separate list. I guess it really depends how big the database is. I've still had no luck finding a sensible way to scrape wikipedia lists. Currently there are is all sorts of rubbish in it (both for bands and foods).
On 19 June 2014 15:15, Ollie Bennett notifications@github.com wrote:
Yeah, I guess the names would typically be processed in order. Only thought of a database so we could pre-compute the hash values of the inner loop (i.e. band names) once and perhaps do some more interesting or intelligent detection of good matches.
Just brainstorming really - certainly no need to get a database going yet (if at all).
— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/3#issuecomment-46565385 .
Awesome dude! Push up your changes so I can have a play too!
I pushed it to a new branch. It outputs a lot of junk because I haven't been able to clean up the food database as well as I would like. Also in the version I pushed I think I changed the band names to a much shorter list with only famous artists in it. Nut King Cole, Elvis Parsely.
On 21 June 2014 08:58, Ollie Bennett notifications@github.com wrote:
Awesome dude! Push up your changes so I can have a play too!
— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/3#issuecomment-46747422 .
Pasta Rhymes Fall Oat Boy So Salad Crew Baking back Sunday Crostini Aguillera
I think what I should do is make sure to remove all duplicates from the food database. Also there are a lot of bands with similar names (lots of whites, bryans etc..) so it will report tons of matches once it finds a food that's a good fit. You have to scan through and try to find the famous band (with the large band database).
On 21 June 2014 16:37, Jonathan Keelan sock.uk@gmail.com wrote:
I pushed it to a new branch. It outputs a lot of junk because I haven't been able to clean up the food database as well as I would like. Also in the version I pushed I think I changed the band names to a much shorter list with only famous artists in it. Nut King Cole, Elvis Parsely.
On 21 June 2014 08:58, Ollie Bennett notifications@github.com wrote:
Awesome dude! Push up your changes so I can have a play too!
— Reply to this email directly or view it on GitHub https://github.com/olliebennett/reporrigory/issues/3#issuecomment-46747422 .
Load test data into SQLite database to allow faster access to pre-calculated hash values for each string.