rufuspollock / ideas

Ideas for (tech) stuff to research, build or work on.
https://rufuspollock.com/
50 stars 4 forks source link

Bookmarking for datasets - ultra quick, but with connection to a bigger workflow #49

Open rufuspollock opened 11 years ago

rufuspollock commented 11 years ago
gr33ndata commented 11 years ago

Are you still interested in this project, I can give it a try if still needed.

rufuspollock commented 11 years ago

@gr33ndata I think so. First thing would be to a) write up a quick couple of user stories b) detail (a bit more) the absolutely minimal thing we could do that would be useful :-)

In terms of a name for it: how about databookmarks (datapins? a la pinboard pinterest etc)?

gr33ndata commented 11 years ago

User Stories:

  1. A user is looking for the governmental budget of his country. In the search box of the website's homepage, she enters 'budget government egypt', all datasets related to her search query are retrieved and ordered according to their relevance.
  2. A user is looking for a dataset for the train-stations in England, especially geodata for their locations. In the search box of the website's homepage, she enters 'trains england'. She is not please with the returned results. She shall refine her search query, additionally, underneath each results there is a listing of the tags assigned to each dataset on delicious. She finds the tag 'latlng' underneath one of the results, so she changes her query accordingly to become, 'trains england latlng'.
  3. A data journalist is looking for stories to cover. She enters a generic search query 'opendata kenya'. There are options to sort results by their data, based on her hypothesis that the most recent datasets are more likely to be used in building a hot story, she sorts data by data. Rather than looking at the results themselves, she traverses the tags assigned to each of them, in order to find out what kind of data is being released as open data nowadays.

Implementation Details:

  1. We agree on a unique hashtag and ask others to use it when bookmarking datasets on delicious. (OpenData is not a good option here, since it is used to tag news or jobs related open data besides tagging real datasets)
  2. Our system visits delicious' RSS feed (or API) every while to update its index (so far, this is enough, no need for a real spider jumping from one link to the other online)
  3. We built a vector space model (matrix where rows represent datasets/urls and columns representing tags given to them).
  4. When a user enters a new query, it is converted to a similar format to that of the rows of the matrix, then items in the matrix are returned based on how similar they are to the user query. We may used any basic similarity measure here, euclidean distance or dot products.
rufuspollock commented 11 years ago

@gr33ndata this is great!. Some thoughts:

Overall, I have to say having now thought more (thanks to your excellent user stories) I'm not really sure whether this yet merits work without further thought about how it would work. Options:

rufuspollock commented 11 years ago

@gr33ndata what do you think? very happy for you to forge ahead with this "idea" if that is your preference :-)

gr33ndata commented 11 years ago

Let me send you an email now