Bookmarking for datasets - ultra quick, but with connection to a bigger workflow

rufuspollock commented 11 years ago

Just bookmark with a specific tag (e.g. on pinboard, delicious etc)
We crawl these
Analyze the pages to extract info (or links to datasets if this an html index page)
Store and present

gr33ndata commented 11 years ago

Are you still interested in this project, I can give it a try if still needed.

rufuspollock commented 11 years ago

@gr33ndata I think so. First thing would be to a) write up a quick couple of user stories b) detail (a bit more) the absolutely minimal thing we could do that would be useful :-)

In terms of a name for it: how about databookmarks (datapins? a la pinboard pinterest etc)?

gr33ndata commented 11 years ago

User Stories:

A user is looking for the governmental budget of his country. In the search box of the website's homepage, she enters 'budget government egypt', all datasets related to her search query are retrieved and ordered according to their relevance.
A user is looking for a dataset for the train-stations in England, especially geodata for their locations. In the search box of the website's homepage, she enters 'trains england'. She is not please with the returned results. She shall refine her search query, additionally, underneath each results there is a listing of the tags assigned to each dataset on delicious. She finds the tag 'latlng' underneath one of the results, so she changes her query accordingly to become, 'trains england latlng'.
A data journalist is looking for stories to cover. She enters a generic search query 'opendata kenya'. There are options to sort results by their data, based on her hypothesis that the most recent datasets are more likely to be used in building a hot story, she sorts data by data. Rather than looking at the results themselves, she traverses the tags assigned to each of them, in order to find out what kind of data is being released as open data nowadays.

Implementation Details:

We agree on a unique hashtag and ask others to use it when bookmarking datasets on delicious. (OpenData is not a good option here, since it is used to tag news or jobs related open data besides tagging real datasets)
Our system visits delicious' RSS feed (or API) every while to update its index (so far, this is enough, no need for a real spider jumping from one link to the other online)
We built a vector space model (matrix where rows represent datasets/urls and columns representing tags given to them).
When a user enters a new query, it is converted to a similar format to that of the rows of the matrix, then items in the matrix are returned based on how similar they are to the user query. We may used any basic similarity measure here, euclidean distance or dot products.

rufuspollock commented 11 years ago

@gr33ndata this is great!. Some thoughts:

Is the focus searching these results or collecting what people are digging up? I guess I'd thought it was more about making it super easy for people to submit datasets (though I now wonder: why? Why is that useful?)
Related to that I'm wondering whether searching is better thing here (we can easily end up with a lot of datasets). I think the core of the bookmark idea was thinking along the lines of Friedrich's post about data catalogs are people http://pudo.org/blog/2012/09/25/datacatalogues.html - by tapping into bookmarking we could tap into this (in which case preserving who bookmarked what would be really important)

Overall, I have to say having now thought more (thanks to your excellent user stories) I'm not really sure whether this yet merits work without further thought about how it would work. Options:

Do further refinement
Dive in anyway (this is all about fun!)
Pick another item - if you were looking for suggestions how about csv editor, deflator service or hacking on data.okfn.org

rufuspollock commented 11 years ago

@gr33ndata what do you think? very happy for you to forge ahead with this "idea" if that is your preference :-)

gr33ndata commented 11 years ago

Let me send you an email now

rufuspollock / ideas

Bookmarking for datasets - ultra quick, but with connection to a bigger workflow #49